abali:ibm370:
Bülent Abali, Bruce D. Gavril, Richard W. Hadsell, Linh Lam, and Brion Shimamoto. Many/370: A parallel computer prototype for I/O intensive applications. In Proceedings of the Sixth Annual Distributed-Memory Computer Conference, pages 728-730, 1991.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: Describes a parallel IBM/370, where they attach several small 370s to a switch, and several disks to each 370. Not much in the way of striping.

abawajy:scheduling:
J. H. Abawajy. Performance analysis of parallel I/O scheduling approaches on cluster computing systems. In Workshop on Parallel I/O in Cluster Computing and Computational Grids, pages 724-729, Tokyo, May 2003. Carleton University, IEEE Computer Society Press. Organized at the IEEE/ACM International Symposium on Cluster Computing and the Grid 2003.

Abstract: As computation and communication hardware performance continue to rapidly increase, I/O represents a growing fraction of application execution time. This gap between the I/O subsystem and others is expected to increase in future since I/O performance is limited by physical motion. Therefore, it is imperative that novel techniques for improveing I/O performance be developed. Parallel I/O is a promising approach to alleviating this bottleneck. However, very little work exist with respect to scheduling parallel I/O operations explicitly. In this paper, we address the problem of effective management of parallel I/O in cluster computing systems by using appropriate I/O scheduling strategies. We propose two new I/O scheduling algorithms and compare them with two existing scheduling Approaches. The preliminary results show that the proposed policies outperform existing policies substantially.

Keywords: parallel I/O, I/O scheduling algorithms, pario-bib

abello:dimacs:
James Abello and Jeffrey Scott Vitter, editors. External Memory Algorithms and Visualization. DIMACS Series in Discrete Mathematics and Theoretical Computer Science. American Mathematical Society Press, Providence, RI, 1999.

Keywords: parallel I/O, out-of-core algorithm, pario-bib

Comment: See also the component papers vitter:survey, arge:lower, crauser:segment, grossi:crosstrees, toledo:survey. Not clear to what extent these papers are about *parallel* I/O.

abello:graph:
James Abello, Adam L. Buchsbaum, and Jeffrey R. Westbrook. A functional approach to external memory graph algorithms. In Proceedings of the 6th Annual European Symposium on Algorithms, volume 1461 of Lecture Notes in Computer Science, pages 332-343, Venice, Italy, August 1998. Springer-Verlag.

Keywords: out-of-core algorithm, graph, pario-bib

abu-safah:speedup:
Walid Abu-Safah, Harlan Husmann, and David Kuck. On Input/Output speed-up in tightly-coupled multiprocessors. IEEE Transactions on Computers, pages 520-530, 1986.

Keywords: parallel I/O, I/O, pario-bib

Comment: Derives formulas for the speedup with and without I/O considered and with parallel software and hardware format conversion. Considering I/O gives a more optimistic view of the speedup of a program assuming that the parallel version can use its I/O bandwidth as effectively as the serial processor. Concludes that, for a given number of processors, increasing the I/O bandwidth is the most effective way to speed up the program (over the format conversion improvements).

acharya:tuning:
Anurag Acharya, Mustafa Uysal, Robert Bennett, Assaf Mendelson, Michael Beynon, Jeffrey K. Hollingsworth, Joel Saltz, and Alan Sussman. Tuning the performance of I/O intensive parallel applications. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 15-27, Philadelphia, May 1996. ACM Press.

Abstract: Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of satellite-data processing and linear algebra. After tuning, three of the four applications achieve application-level I/O rates of over 100 MB/s on 16 processors. The total volume of I/O required by the programs ranged from about 75 MB to over 200 GB. We report the lessons learned in achieving high I/O performance from these applications, including the need for code restructuring, local disks on every node and knowledge of future I/O requests. We also report our experience on achieving high performance on peer-to-peer configurations. Finally, we comment on the necessity of complex I/O interfaces like collective I/O and strided requests to achieve high performance.

Keywords: parallel I/O, filesystem workload, parallel application, pario-bib

aggarwal:sorting:
Alok Aggarwal and Jeffrey Scott Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116-1127, September 1988.

Abstract: We provide tight upper and lower bounds, up to a constant factor, for the number of inputs and outputs (I/Os) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition. The bounds hold both in the worst case and in the average case, and in several situations the constant factors match. Secondary storage is modeled as a magnetic disk capable of transfering $P$ blocks each containing $B$ records in a single time unit; the records in each block must be input from or output to $B$ contiguous locations on the disk. We give two optimal algorithms for the problems, which are variants of merge sorting and distribution sorting. In particular we show for $P=1$ that the standard merge sorting algorithm is an optimal external sorting method, up to a constant factor in the number of I/Os. Our sorting algorithms use the same number of I/Os as does the permutation phase of key sorting, except when the internal memory size is extremely small, thus affirming the popular adage that key sorting is not faster. We also give a simpler and more direct derivation of Hong and Kung's lower bound for the FFT for the special case $B = P = O(1)$.

Keywords: parallel I/O, sorting, pario-bib

Comment: Good comments on typical external sorts, and how big they are. Focuses on parallelism at the disk. They give tight theoretical bounds on the number of I/O's required to do external sorting and other problems (FFTs, matrix transpose, etc.). If $x$ is the number of blocks in the file and $y$ is the number of blocks that fit in memory, then the number of I/Os needed grows as $Θ (x \log x / \log y)$. If parallel transfers of $p$ blocks are allowed, speedup linear in $p$ is obtained.

agrawal:asynch:
Gagan Agrawal, Anurag Acharya, and Joel Saltz. An interprocedural framework for placement of asynchronous I/O operations. In Proceedings of the 10th ACM International Conference on Supercomputing, pages 358-365, Philadelphia, PA, May 1996. ACM Press.

Keywords: compiler, I/O, pario-bib

Comment: Not really about parallel applications or parallel I/O, but I think it may be of interest to that community. They propose a compiler framework for a compiler to insert asynchronous I/O operations (start I/O, finish I/O), to satisfy the dependency constraints of the program.

aguilar:graph:
Jose Aguilar. A graph theoretical model for scheduling simultaneous I/O operations on parallel and distributed environments. Parallel Processing Letters, 12(1):113-126, March 2002.

Abstract: The motivation for the research presented here is to develop an approach for scheduling I/O operations in distributed/parallel computer systems. First, a general model for specifying the parallel I/O scheduling problem is developed. The model defines the I/O bandwidth for different parallel/distributed architectures. Then the model is used to establish an algorithm for scheduling I/O operations on these architectures.

Keywords: parallel I/O, scheduling, pario-bib

ali:enhancing:
Zeyad Ali and Qutaibah Malluhi. Enhancing data-intensive applications performance by tuning the distributed storage policies. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA'04, volume 3, pages 1515-1522, Las Vegas, NV, June 2004.

Abstract: This paper describes the performance improvements achieved by a data-intensive application by controlling the storage policies and algorithms of a distributed storage system. The Network Storage Manager (NSM) is a mass distributed storage framework with a unique architecture that provides applications with the high-performance features they need. It also provides the standard most commonly used implementation for storage policies. Distributed Terrain Viewer (DTViewer) is an application that utilizes NSM architecture and for efficient and reliable data delivery. Moreover, it exploits NSM controllable architecture by plugging-in its application-specific optimized implementations. DTViewer overrides the default NSM policies that do not understand its sophisticated access patterns, partitioning, and storage layout requirements. Experimental results have show significant improvement achieved when the application-tailored implementation are used. Such speedups are not achievable on storage systems with no application control such as the Parallel Virtual File System PVFS. (44 Refs.)

Keywords: application-specific storage policies, pario-app, DTViewer, access patterns, data layout, pario-bib

allcock:grid:
Bill Allcock, Joe Bester, John Bresnahan, Ann L. Chervenak, Ian Foster, Carl Kesselman, Sam Meder, Veronika Nefedova, Darcy Quesnel, and Steven Tuecke. Data management and transfer in high-performance computational grid environments. Parallel Computing, 28(5):749-771, May 2002.

Abstract: An emerging class of data-intensive applications involve the geographically dispersed extraction of complex scientific information from very large collections of measured or computed data. Such applications arise, for example, in experimental physics, where the data in question is generated by accelerators, and in simulation science, where the data is generated by supercomputers. So-called Data Grids provide essential infrastructure for such applications, much as the Internet provides essential services for applications such as e-mail and the Web. We describe here two services that we believe are fundamental to any Data Grid: reliable, high-speed transport and replica management. Our high-speed transport service, GridFTP, extends the popular FTP protocol with new features required for Data Grid applications, such as striping and partial file access. Our replica management service integrates a replica catalog with GridFTP transfers to provide for the creation, registration, location, and management of dataset replicas. We present the design of both services and also preliminary performance results. Our implementations exploit security and other services provided by the Globus Toolkit.

Keywords: computational grid, data transfer, network, I/O, pario-bib

allen:cactus:
Gabrielle Allen, Tom Goodale, Joan Massó, and Edward Seidel. The cactus computational toolkit and using distributed computing to collide neutron stars. In Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing, pages 57-61, Redondo Beach, CA, August 1999. IEEE Computer Society Press.

Abstract: We are developing a system for collaborative research and development for a distributed group of researchers at different institutions around the world. In a new paradigm for collaborative computational science, the computer code and supporting infrastructure itself becomes the collaborating instrument, just as an accelerator becomes the collaborating tool for large numbers of distributed researchers in particle physics. The design of this "Collaboratory" allows many users, with very different areas of expertise, to work coherently together, on distributed computers around the world. Different supercomputers may be used separately, or for problems exceeding the capacity of any single system, multiple supercomputers may be networked together through high speed gigabit networks. Central to this Collaboratory is a new type of community simulation code, called "Cactus". The scientific driving force behind this project is the simulation of Einstein's equations for studying black holes, gravitational waves, and neutron stars, which has brought together researchers in very different fields from many groups around the world to make advances in the study of relativity and astrophysics. But the system is also being developed to provide scientists and engineers, without expert knowledge of parallel or distributed computing, mesh refinement, and so on, with a simple framework for solving any system of partial differential equations on many parallel computer systems, from traditional supercomputers to networks of workstations.

Keywords: scientific application, grid, input/output, parallel-io, pario-bib

Comment: invited talk. They describe a computational toolkit (CACTUS) that allows developers to construct code modules (thorns) to plug into the core system (cactus flesh). The toolkit includes thorns for solving partial differential equations using MPI, parallel elliptic solvers, thorns for I/O using FlexIO or HDF5, and thorns for checkpointing. The talk showed results from a cactus code demo that ran at SC'98. The demo combined two tightly-connected supercomputers (one in Europe and one in America) using Globus to simulate the collision of two neutron stars.

alvarez:failures:
Guillermo A. Alvarez, Walter A. Burkhard, and Flaviu Cristian. Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 62-72. IEEE Computer Society Press, May 1997.
See also later version alvarez:bfailures.

Keywords: fault tolerance, RAID, disk array, parallel I/O, pario-bib

alvarez:jminerva:
Guillermo A. Alvarez, Elizabeth Borowsky, Susie Go, Theodore H. Romer, Ralph Becker-Szendy, Richard Golding, Arif Merchant, Mirjana Spasojevic, Alistair Veitch, and John Wilkes. Minerva: An automated resource provisioning tool for large-scale storage systems. ACM Transactions on Computer Systems, 19(4):483-518, November 2001.

Abstract: Enterprise-scale storage systems, which can contain hundreds of host computers and storage devices and up to tens of thousands of disks and logical volumes, are difficult to design. The volume of choices that need to be made is massive, and many choices have unforeseen interactions. Storage system design is tedious and complicated to do by hand, usually leading to solutions that are grossly over-provisioned, substantially under-performing or, in the worst case, both.To solve the configuration nightmare, we present minerva: a suite of tools for designing storage systems automatically. Minerva uses declarative specifications of application requirements and device capabilities; constraint-based formulations of the various sub-problems; and optimization techniques to explore the search space of possible solutions.This paper also explores and evaluates the design decisions that went into Minerva, using specialized micro- and macro-benchmarks. We show that Minerva can successfully handle a workload with substantial complexity (a decision-support database benchmark). Minerva created a 16-disk design in only a few minutes that achieved the same performance as a 30-disk system manually designed by human experts. Of equal importance, Minerva was able to predict the resulting system's performance before it was built.

Keywords: disk array, storage system, RAID, automatic design, parallel I/O, pario-bib

alverson:tera:
Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, and Burton Smith. The Tera computer system. In Proceedings of the 1990 ACM International Conference on Supercomputing, pages 1-6, 1990.

Keywords: parallel architecture, MIMD, NUMA, pario-bib

Comment: Interesting architecture. 3-d mesh of pipelined packet-switch nodes, e.g., 16x16x16 is 4096 nodes, with 256 procs, 512 memory units, 256 I/O cache units, and 256 I/O processors attached. 2816 remaining nodes are just switching nodes. Each processor is 64-bit custom chip with up to 128 simultaneous threads in execution. It alternates between ready threads, with a deep pipeline. Inter-instruction dependencies explicitly encoded by the compiler, stalling those threads until the appropriate time. Each thread has a complete set of registers! Memory units have 4-bit tags on each word, for full/empty and trap bits. Shared memory across the network: ``The Tera ISP-level architecture is UMA, even though the PMS-level architecture is NUMA. Put another way, the memory looks a single cycle away to the compiler writer.'' - Burton Smith. See also tera:brochure.

anderson:bserverless:
Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe Matthews, David A. Patterson, Drew S. Roselli, and Randolph Y. Wang. Serverless network file systems. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 24, pages 364-385. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version anderson:serverless.

Keywords: file caching, distributed file system, pario-bib

Comment: Part of jin:io-book; reformatted version of anderson:serverless.

anderson:buttress:
Eric Anderson, Mahesh Kallahalla, Mustafa Uysal, and Ram Swaminathan. Buttress: A toolkit for flexible and high fidelity I/O benchmarking. In Proceedings of the USENIX FAST '04 Conference on File and Storage Technologies, pages 45-58, San Francisco, CA, March 2004. Hewlett-Packard Laboratories, USENIX Association.

Abstract: In benchmarking I/O systems, it is important to generate, accurately, the I/O access pattern that one is intending to generate. However, timing accuracy ( issuing I/Os at the desired time) at high I/O rates is difficult to achieve on stock operating systems. We currently lack tools to easily and accurately generate complex I/O workloads on modern storage systems. As a result, we may be introducing substantial errors in observed system metrics when we benchmark I/O systems using inaccurate tools for replaying traces or for producing synthetic workloads with known inter-arrival times.

In this paper, we demonstrate the need for timing accuracy for I/O benchmarking in the context of replaying I/O traces. We also quantitatively characterize the impact of error in issuing I/Os on measured system parameters. For instance, we show that the error in perceived I/O response times can be as much as +350% or -15% by using naive benchmarking tools that have timing inaccuracies. To address this problem, we present Buttress, a portable and flexible toolkit that can generate I/O workloads with microsecond accuracy at the I/O throughputs of high-end enterprise storage arrays. In particular, Buttress can issue I/O requests within 100µs of the desired issue time even at rates of 10000 I/Os per second (IOPS).

Keywords: benchmarking software, performance analysis, I/O access patterns, I/O workloads, pario-bib

Comment: Looks like a really cool piece of software. Generates I/O workloads by replaying I/O traces.

anderson:raid:
Eric Anderson, Ram Swaminathan, Alistair Veitch, Guillermo A. Alvarez, and John Wilkes. Selecting RAID levels for disk arrays. In Proceedings of the USENIX FAST '02 Conference on File and Storage Technologies, pages 189-202, Monterey, CA, January 2002. USENIX Association.

Abstract: Disk arrays have a myriad of configuration parameters that interact in counter-intuitive ways, and those interactions can have significant impacts on cost, performance, and reliability. Even after values for these parameters have been chosen, there are exponentially-many ways to map data onto the disk arrays' logical units. Meanwhile, the importance of correct choices is increasing: storage systems represent an growing fraction of total system cost, they need to respond more rapidly to changing needs, and there is less and less tolerance for mistakes. We believe that automatic design and configuration of storage systems is the only viable solution to these issues. To that end, we present a comparative study of a range of techniques for programmatically choosing the RAID levels to use in a disk array. Our simplest approaches are modeled on existing, manual rules of thumb: they "tag" data with a RAID level before determining the configuration of the array to which it is assigned. Our best approach simultaneously determines the RAID levels for the data, the array configuration, and the layout of data on that array. It operates as an optimization process with the twin goals of minimizing array cost while ensuring that storage workload performance requirements will be met. This approach produces robust solutions with an average cost/performance 14-17{PCT} better than the best results for the tagging schemes, and up to 150-200{PCT} better than their worst solutions. We believe that this is the first presentation and systematic analysis of a variety of novel, fully-automatic RAID-level selection techniques.

Keywords: file systems, pario-bib

anderson:serverless:
Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe, David A. Patterson, Drew S. Roselli, and Randolph Y. Wang. Serverless network file systems. ACM Transactions on Computer Systems, 14(1):41-79, February 1996.
See also later version anderson:bserverless.

Keywords: file caching, distributed file system, pario-bib

Comment: See anderson:serverless-sosp.

ap:enwrich:
Apratim Purakayastha, Carla Schlatter Ellis, and David Kotz. ENWRICH: a compute-processor write caching scheme for parallel file systems. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 55-68, Philadelphia, May 1996. ACM Press.
See also earlier version ap:enwrich-tr.

Abstract: Many parallel scientific applications need high-performance I/O. Unfortunately, end-to-end parallel-I/O performance has not been able to keep up with substantial improvements in parallel-I/O hardware because of poor parallel file-system software. Many radical changes, both at the interface level and the implementation level, have recently been proposed. One such proposed interface is collective I/O, which allows parallel jobs to request transfer of large contiguous objects in a single request, thereby preserving useful semantic information that would otherwise be lost if the transfer were expressed as per-processor non-contiguous requests. Kotz has proposed disk-directed I/O as an efficient implementation technique for collective-I/O operations, where the compute processors make a single collective data-transfer request, and the I/O processors thereafter take full control of the actual data transfer, exploiting their detailed knowledge of the disk-layout to attain substantially improved performance.

Recent parallel file-system usage studies show that writes to write-only files are a dominant part of the workload. Therefore, optimizing writes could have a significant impact on overall performance. In this paper, we propose ENWRICH, a compute-processor write-caching scheme for write-only files in parallel file systems. ENWRICH combines low-overhead write caching at the compute processors with high performance disk-directed I/O at the I/O processors to achieve both low latency and high bandwidth. This combination facilitates the use of the powerful disk-directed I/O technique independent of any particular choice of interface. By collecting writes over many files and applications, ENWRICH lets the I/O processors optimize disk I/O over a large pool of requests. We evaluate our design via simulated implementation and show that ENWRICH achieves high performance for various configurations and workloads.

Keywords: parallel file system, parallel I/O, caching, pario-bib, dfk

ap:enwrich-tr:
Apratim Purakayastha, Carla Schlatter Ellis, and David Kotz. ENWRICH: a compute-processor write caching scheme for parallel file systems. Technical Report CS-1995-22, Dept. of Computer Science, Duke University, October 1995.
See also later version ap:enwrich.

Abstract: Many parallel scientific applications need high-performance I/O. Unfortunately, end-to-end parallel-I/O performance has not been able to keep up with substantial improvements in parallel-I/O hardware because of poor parallel file-system software. Many radical changes, both at the interface level and the implementation level, have recently been proposed. One such proposed interface is collective I/O, which allows parallel jobs to request transfer of large contiguous objects in a single request, thereby preserving useful semantic information that would otherwise be lost if the transfer were expressed as per-processor non-contiguous requests. Kotz has proposed disk-directed I/O as an efficient implementation technique for collective-I/O operations, where the compute processors make a single collective data-transfer request, and the I/O processors thereafter take full control of the actual data transfer, exploiting their detailed knowledge of the disk-layout to attain substantially improved performance.

Recent parallel file-system usage studies show that writes to write-only files are a dominant part of the workload. Therefore, optimizing writes could have a significant impact on overall performance. In this paper, we propose ENWRICH, a compute-processor write-caching scheme for write-only files in parallel file systems. ENWRICH combines low-overhead write caching at the compute processors with high performance disk-directed I/O at the I/O processors to achieve both low latency and high bandwidth. This combination facilitates the use of the powerful disk-directed I/O technique independent of any particular choice of interface. By collecting writes over many files and applications, ENWRICH lets the I/O processors optimize disk I/O over a large pool of requests. We evaluate our design via simulated implementation and show that ENWRICH achieves high performance for various configurations and workloads.

Keywords: parallel file system, parallel I/O, caching, pario-bib, dfk

ap:thesis:
Apratim Purakayastha. Characterizing and Optimizing Parallel File Systems. PhD thesis, Dept. of Computer Science, Duke University, Durham, NC, June 1996. Also available as technical report CS-1996-10.

Abstract: High-performance parallel file systems are needed to satisfy tremendous I/O requirements of parallel scientific applications. The design of such parallel file systems depends on a comprehensive understanding of the expected workload, but so far there have been very few usage studies of multiprocessor file systems. In the first part of this dissertation, we attempt to fill this void by measuring a real file-system workload on a production parallel machine, namely the CM-5 at the National Center for Supercomputing Applications. We collect information about nearly every individual I/O request from the mix of jobs running on the machine. Analysis of the traces leads to various recommendations for design of future parallel file systems. Our usage study showed that writes to write-only files are a dominant part of the workload. Therefore, optimizing writes could have a significant impact on overall performance. In the second part of this dissertation, we propose ENWRICH, a compute-processor write-caching scheme for write-only files in parallel file systems. Within its framework, ENWRICH uses a recently proposed high performance implementation of collective I/O operations called disk-directed I/O, but it eliminates a number of limitations of disk-directed I/O. ENWRICH combines low-overhead write caching at the compute processors with high performance disk-directed I/O at the I/O processors to achieve both low latency and high bandwidth. This combination facilitates the use of the powerful disk-directed I/O technique independent of any particular choice of interface, and without the requirement for mapping libraries at the I/O processors. By collecting writes over many files and applications, ENWRICH lets the I/O processors optimize disk I/O over a large pool of requests. We evaluate our design of ENWRICH using simulated implementation and extensive experimentation. We show that ENWRICH achieves high performance for various configurations and workloads. We pinpoint the reasons for ENWRICH`s failure to perform well for certain workloads, and suggest possible enhancements. Finally, we discuss the nuances of implementing ENWRICH on a real platform and speculate about possible adaptations of ENWRICH for emerging multiprocessing platforms.

Keywords: parallel I/O, multiprocessor file system, file access patterns, workload characterization, file caching, disk-directed I/O, pario-bib

Comment: See also ap:enwrich, ap:workload, and nieuwejaar:workload

ap:workload:
Apratim Purakayastha, Carla Schlatter Ellis, David Kotz, Nils Nieuwejaar, and Michael Best. Characterizing parallel file-access patterns on a large-scale multiprocessor. In Proceedings of the Ninth International Parallel Processing Symposium, pages 165-172. IEEE Computer Society Press, April 1995.
See also earlier version ap:workload-tr.
See also later version nieuwejaar:workload-tr.

Abstract: High-performance parallel file systems are needed to satisfy tremendous I/O requirements of parallel scientific applications. The design of such high-performance parallel file systems depends on a comprehensive understanding of the expected workload, but so far there have been very few usage studies of multiprocessor file systems. This paper is part of the CHARISMA project, which intends to fill this void by measuring real file-system workloads on various production parallel machines. In particular, here we present results from the CM-5 at the National Center for Supercomputing Applications. Our results are unique because we collect information about nearly every individual I/O request from the mix of jobs running on the machine. Analysis of the traces leads to various recommendations for parallel file-system design.

Keywords: parallel I/O, file access pattern, multiprocessor file system, file system workload, dfk, pario-bib

Comment: See also kotz:workload, nieuwejaar:strided.

ap:workload-tr:
Apratim Purakayastha, Carla Schlatter Ellis, David Kotz, Nils Nieuwejaar, and Michael Best. Characterizing parallel file-access patterns on a large-scale multiprocessor. Technical Report CS-1994-33, Dept. of Computer Science, Duke University, October 1994.
See also later version ap:workload.

Abstract: Rapid increases in the computational speeds of multiprocessors have not been matched by corresponding performance enhancements in the I/O subsystem. To satisfy the large and growing I/O requirements of some parallel scientific applications, we need parallel file systems that can provide high-bandwidth and high-volume data transfer between the I/O subsystem and thousands of processors.

Design of such high-performance parallel file systems depends on a thorough grasp of the expected workload. So far there have been no comprehensive usage studies of multiprocessor file systems. Our CHARISMA project intends to fill this void. The first results from our study involve an iPSC/860 at NASA Ames. This paper presents results from a different platform, the CM-5 at the National Center for Supercomputing Applications. The CHARISMA studies are unique because we collect information about every individual read and write request and about the entire mix of applications running on the machines.

The results of our trace analysis lead to recommendations for parallel file system design. First, the file system should support efficient concurrent access to many files, and I/O requests from many jobs under varying load condit ions. Second, it must efficiently manage large files kept open for long periods. Third, it should expect to see small requests, predominantly sequential access patterns, application-wide synchronous access, no concurrent file-sharing between jobs, appreciable byte and block sharing between processes within jobs, and strong interprocess locality. Finally, the trace data suggest that node-level write caches and collective I/O request interfaces may be useful in certain environments.

Keywords: parallel I/O, file access pattern, multiprocessor file system, file system workload, dfk, pario-bib

Comment: See also kotz:workload, nieuwejaar:strided.

arendt:genome:
James W. Arendt. Parallel genome sequence comparison using a concurrent file system. Technical Report UIUCDCS-R-91-1674, University of Illinois at Urbana-Champaign, 1991.

Keywords: parallel file system, parallel I/O, Intel iPSC/2, pario-bib

Comment: Studies the performance of Intel CFS. Uses an application that reads in a huge file of records, each a genome sequence, and compares each sequence against a given sequence. Looks at cache performance, message latency, cost of prefetches and directory reads, and throughput. He tries one-disk, one-proc transfer rates. Because of contention with the directory server on one of the two I/O nodes, it was faster to put all of the file on the other I/O node. Striping is good for multiple readers. Best access pattern was interleaved, not segmented or separate files, because it avoided disk seeks. Perhaps the files are stored contiguously? Can get good speedup by reading the sequences in big integral record sizes, from CFS, using a load-balancing scheduled by the host. Contention for directory blocks - through single-node directory server.

arge:GIS:
Lars Arge. External-memory algorithms with applications in GIS. In Marc van Kreveld, Jurg Nievergelt, Thomas Roos, and Peter Widmayer, editors, Algorithmic foundations of geographic information systems, volume 1340 of Lecture Notes in Computer Science, pages 213-254. Springer-Verlag, 1997.

Abstract: The paper presents a survey of the basic paradigms for designing efficient external-memory algorithms and especially for designing external-memory algorithms for computational geometry problems with applications in GIS. As the area of external-memory algorithms is relatively young the paper focuses on fundamental external-memory design techniques more than on algorithms for specific GIS problems. The presentation is survey-like with a more detailed discussion of the most important techniques and algorithms.

Keywords: out-of-core algorithm, geographic information system, GIS, pario-bib

Comment: not parallel? but mentions some parallel disk stuff.

arge:jsegments:
Lars Arge, Darren Erik Vengroff, and Jeffrey Scott Vitter. External-memory algorithms for processing line segments in geographic information systems. Algorithmica, 1998. To appear.
See also earlier version arge:segments.

Abstract: We present a set of algorithms designed to solve large-scale geometric problems involving collections of line segments in the plane. Geographical information systems (GIS) handle large amounts of spatial data, and at some level the data is often manipulated as collections of line segments. NASA's EOS project is an example of a GIS that is expected to store and manipulate petabytes (thousands of terabytes, or millions of gigabytes) of data! In the design of algorithms for this type of large-scale application, it is essential to consider the problem of minimizing I/O communication, which is the bottleneck.

In this paper we develop efficient new external-memory algorithms for a number of important problems involving line segments in the plane, including trapezoid decomposition, batched planar point location, triangulation, red-blue line segment intersection reporting, and general line segment intersection reporting. In GIS systems, the first three problems are useful for rendering and modeling, and the latter two are frequently used for overlaying maps and extracting information from them. To solve these problems, we combine and modify in novel ways several of the previously known techniques for designing efficient algorithms for external memory. We also develop a powerful new technique that can be regarded as a practical external memory version of fractional cascading. Except for the batched planar point location problem, no algorithms specifically designed for external memory were previously known for these problems. Our algorithms for triangulation and line segment intersection partially answer previously posed open problems, while the batched planar point location algorithm improves on the previously known solution, which applied only to monotone decompositions. Our algorithm for the red-blue line segment intersection problem is provably optimal.

Keywords: verify, out-of-core algorithm, computational geometry, pario-bib

Comment: Special issue on cartography and geographic information systems.

arge:lower:
Lars Arge and Peter Bro Miltersen. On showing lower bounds for external-memory computational geometry problems. In Abello and Vitter [abello:dimacs], pages 139-160.

Keywords: out-of-core algorithm, computational geometry, pario-bib

Comment: See also the component papers vitter:survey, arge:lower, crauser:segment, grossi:crosstrees, toledo:survey. Not clear to what extent these papers are about *parallel* I/O.

arge:segments:
Lars Arge, Darren Erik Vengroff, and Jeffrey Scott Vitter. External-memory algorithms for processing line segments in geographic information systems. In Proceedings of the Third European Symposium on Algorithms, volume 979 of Lecture Notes in Computer Science, pages 295-310, Corfu, Greece, September 1995. Springer-Verlag.
See also later version arge:jsegments.

Abstract: In the design of algorithms for large-scale applications it is essential to consider the problem of minimizing I/O communication. Geographical information systems (GIS) are good examples of such large-scale applications as they frequently handle huge amounts of spatial data. In this paper we develop efficient new external-memory algorithms for a number of important problems involving line segments in the plane, including trapezoid decomposition, batched planar point location, triangulation, red-blue line segment intersection reporting, and general line segment intersection reporting. In GIS systems, the first three problems are useful for rendering and modeling, and the latter two are frequently used for overlaying maps and extracting information from them.

Keywords: out-of-core algorithm, computational geometry, pario-bib

Comment: Does deal with parallel disks, though not in great detail.

arge:sorting:
Lars Arge, Paolo Ferragina, Roberto Grossi, and Jeffrey Scott Vitter. Sequence sorting in secondary storage. In Proceedings of Compression and Complexity of Sequences, pages 329-346, Salerno, Italy, June 1998. IEEE Computer Society Press.

Abstract: We investigate the I/O complexity of the problem of sorting sequences (or strings of characters) in external memory, which is a fundamental component of many large-scale text applications. In the standard unit-cost RAM comparison model, the complexity of sorting K strings of total length N is Theta (K log/sub 2/ K+N). By analogy, in the external memory (or I/O) model, where the internal memory has size M and the block transfer size is B, it would be natural to guess that the I/O complexity of sorting sequences is Theta ((K/B)log/sub M/B/(K/B)+(N/B)), but the known algorithms do not come even close to achieving this bound. Our results show, somewhat counterintuitively, that the I/O complexity of string sorting depends upon the length of the strings relative to the block size. We first consider a simple comparison I/O model, where the strings are not allowed to be broken into their individual characters, and we show that the I/O complexity of string sorting in this model is Theta ((N/sub 1//B)log/sub M/B/(N/sub 1//B)+K/sub 2/+(N/B)), where N/sub 1/ is the total length of all strings shorter than B and K/sub 2/ is the number of strings longer than B. We then consider two more general I/O comparison models in which string breaking is allowed. We obtain improved algorithms and in several cases lower bounds that match their I/O bounds. Finally, we develop more practical algorithms outside the comparison model.

Keywords: out-of-core algorithm, sorting algorithm, pario-bib

Comment: This paper is really the same paper as arge:sorting-strings.

arge:sorting-strings:
Lars Arge, Paolo Ferragina, Roberto Grossi, and Jeffrey Scott Vitter. On sorting strings in external memory. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing, pages 540-548, El Paso, May 1997. ACM Press.

Abstract: In this paper we address for the first time the I/O complexity of the problem of sorting strings in external memory, which is a fundamental component of many large-scale text applications. In the standard unit-cost RAM comparison model, the complexity of sorting K strings of total length N is theta(K log K + N). By analogy, in the external memory (or I/O) model, where the internal memory has size M and the block transfer size is B, it would be natural to guess that the I/O complexity of sorting strings is $θ(K/B log_(M/B) (K/B) + N/B)$, but the known algorithms do not come even close to achieving this bound. Our results show, somewhat counterintuitively, that the I/O complexity of string sorting depends upon the length of the strings relative to the block size. We first consider a simple comparison I/O model, where one is not allowed to break the strings into their characters, and we show that the I/O complexity of string sorting in this model is $θ(N_1/B log_(M/B) (N_1/B) + K_2 log_(M/B) K_2 + N/B)$, where $N_1$ is the total length of all strings shorter than B and $K_2$ is the number of strings longer than B. We then consider two more general I/O comparison models in which string breaking is allowed. We obtain improved algorithms and in several cases lower bounds that match their I/O bounds. Finally, we develop more practical algorithms without assuming the comparison model.

Keywords: out-of-core algorithm, sorting, parallel I/O, pario-bib

Comment: Not parallel? But mentions some parallel disk stuff.

armen:disk-model:
Chris Armen. Bounds on the separation of two parallel disk models. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 122-127, Philadelphia, May 1996. ACM Press.

Abstract: The single-disk, D-head model of parallel I/O was introduced by Agarwal and Vitter to analyze algorithms for problem instances that are too large to fit in primary memory. Subsequently Vitter and Shriver proposed a more realistic model in which the disk space is partitioned into D disks, with a single head per disk. To date, each problem for which there is a known optimal algorithm for both models has the same asymptotic bounds on both models. Therefore, it has been unknown whether the models are equivalent or whether the single-disk model is strictly more powerful.

In this paper we provide evidence that the single-disk model is strictly more powerful. We prove a lower bound on any general simulation of the single-disk model on the multi-disk model and establish randomized and deterministic upper bounds. Let $N$ be the problem size and let $T$ be the number of parallel I/Os required by a program on the single-disk model. Then any simulation of this program on the multi-disk model will require $Ω\left(T \frac{\log(N/D)}{\log \log(N/D)}\right)$ parallel I/Os. This lower bound holds even if replication is allowed in the multi-disk model. We also show an $O\left(\frac{\log D}{\log \log D}\right)$ randomized upper bound and an $O\left(\log D (\log \log D)^2\right)$ deterministic upper bound. These results exploit an interesting analogy between the disk models and the PRAM and DCM models of parallel computation.

Keywords: parallel I/O, theory, parallel I/O algorithm, pario-bib

arpaci-dusseau:jriver:
H. Arpaci-Dusseau, Remzi. Run-time adaptation in River. ACM Transactions on Computer Systems, 21(1):36-86, February 2003.

Keywords: distributed query processing, dataflow, pario-bib

Comment: River is a dataflow programming environment for database query processing applications. River is specifically designed for clusters of computers with heterogeneous performance characteristics. The goal of the River runtime system is to adapt to "performance faults"-portions of the system that perform poorly by dynamically adjusting the transfer of data through the dataflow graph. River uses two constructs to build applications: a distributed queue that deals with performance faults by consumers, and graduated declustering that deals with performance faults of producers. A distributed queue pushes data through the dataflow graph at a rate proportional to the rate of consumption and adapts to changes in consumption rates. Graduated declustering deals with producer performance faults by reading from replicated producers. Although River is designed specifically for query processing, they briefly discuss how one might adapt scientific applications to work in their framework.

arpaci-dusseau:river:
Remzi H. Arpaci-Dusseau, Eric Anderson, Noah Treuhaft, David E. Culler, Joseph M. Hellerstein, David Patterson, and Kathy Yelick. Cluster I/O with River: Making the fast case common. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems, pages 10-22, Atlanta, GA, May 1999. ACM Press.

Abstract: We introduce River, a data-flow programming environment and I/O substrate for clusters of computers. River is designed to provide maximum performance in the common case- even in the face of non-uniformities in hardware, software, and workload. River is based on two simple design features: a high-performance distributed queue,and a storage redundancy mechanism called graduated declustering.We have implemented a number of data-intensive applications on River, which validate our design with near-ideal performance in a variety of non-uniform performance scenarios.

Keywords: cluster computing, parallel I/O, pario-bib

arunachalam:prefetch:
Meenakshi Arunachalam, Alok Choudhary, and Brad Rullman. A prefetching prototype for the parallel file system on the Paragon. In Proceedings of the 1995 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 321-323, May 1995. Extended Abstract.

Keywords: parallel I/O, prefetching, parallel file system, pario-bib

Comment: A related paper is arunachalam:prefetch2.

arunachalam:prefetch2:
Meenkashi Arunachalam, Alok Choudhary, and Brad Rullman. Implementation and evaluation of prefetching in the Intel Paragon Parallel File System. In Proceedings of the Tenth International Parallel Processing Symposium, pages 554-559, April 1996.

Abstract: The significant difference between the speeds of the I/O system (e.g., disks) and compute processors in parallel systems creates a bottleneck that lowers the performance of an application that does a considerable amount of disk accesses. A major portion of the compute processors' time is wasted on waiting for I/O to complete. This problem can be addressed to a certain extent, if the necessary data can be fetched from the disk before the I/O call to the disk is issued. Fetching data ahead of time, known as prefetching in a multiprocessor environment depends a great deal on the application's access pattern. The subject of this paper is implementation and performance evaluation of a prefetching prototype in a production parallel file system on the Intel Paragon. Specifically, this paper presents a) design and implementation of a prefetching strategy in the parallel file system and b) performance measurements and evaluation of the file system with and without prefetching. The prototype is designed at the operating system level for the PFS. It is implemented in the PFS subsystem of the Intel Paragon Operating System. It is observed that in many cases prefetching provides considerable performance improvements. In some other cases no improvements or some performance degradation is observed due to the overheads incurred in prefetching.

Keywords: parallel I/O, prefetching, multiprocessor file system, pario-bib

Comment: See arunachalam:prefetch.

asami:bself:
Satoshi Asami, Nisha Talagala, and David A. Patterson. Designing a self-maintaining storage system. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 30, pages 453-463. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version asami:self.

Keywords: parallel I/O, disk array, RAID, pario-bib

Comment: Part of jin:io-book; reformatted version of asami:self.

asami:self:
Satoshi Asami, Nisha Talagala, and David A. Patterson. Designing a self-maintaining storage system. In Proceedings of the Sixteenth IEEE Symposium on Mass Storage Systems, pages 222-233. IEEE Computer Society Press, March 1999.
See also later version asami:bself.

Abstract: This paper shows the suitability of a self-maintaining approach to Tertiary Disk, a large-scale disk array system built from commodity components. Instead of incurring the cost of custom hardware, we attempt to solve various problems by design and software. We have built a cluster of storage nodes connected by switched Ethernet. Each storage node is a PC hosting a few dozen SCSI disks, running the FreeBSD operating system. The system is used as a web-based image server for the Zoom Project in cooperation with the Fine Arts Museums of San Francisco (http://www.thinker.org/). We are designing self-maintenance extension to the OS to run on this cluster to mitigate the system administrator's burden. There are several components required for building self-maintaining system. One is decoupling the time failure from the time of hardware replacement. This implies the system must have some amount of redundancy, and has no single point of failure. Our system is fully redundant, and everything is constructed to avoid a single point of failure. Another is correctly identifying failures and their dependencies. The paper also outlines several approaches to lower the human cost of system administration of such a system and making the system as autonomous as possible.

Keywords: parallel I/O, disk array, RAID, pario-bib

asbury:fortranio:
Raymond K. Asbury and David S. Scott. FORTRAN I/O on the iPSC/2: Is there read after write?. In Proceedings of the Fourth Conference on Hypercube Concurrent Computers and Applications, pages 129-132, Monterey, CA, 1989. Golden Gate Enterprises, Los Altos, CA.

Keywords: parallel I/O, hypercube, Intel iPSC/2, file access pattern, pario-bib

asthana:active:
Abhaya Asthana, Mark Cravatts, and Paul Krzyzanowski. An experimental active memory based I/O subsystem. In Proceedings of the IPPS '94 Workshop on Input/Output in Parallel Computer Systems, pages 73-84. AT&T Bell Labs, April 1994. Also appeared in Computer Architecture News 22(4).
See also later version asthana:active-book.

Keywords: parallel I/O, architecture, pario-bib

Comment: They describe an I/O subsystem based on an ``active memory'' called SWIM (Structured Wafer-based Intelligent Memory). SWIM chips are RAM chips with some built-in processing. The idea is that these tiny processors can manipulate the data in the chip at full speed, without dealing with memory bus or off-chip costs. Further, the chips can work in parallel. They demonstrate how they've used this to build a national phone database server, a high-performance IP router, and a call-screening agent.

asthana:active-book:
Abhaya Asthana, Mark Cravatts, and Paul Krzyzanowski. An experimental memory-based I/O subsystem. In Jain et al. [iopads-book], chapter 17, pages 373-390.
See also earlier version asthana:active.

Abstract: We describe an I/O subsystem based on an active memory named SWIM (Structured Wafer-based Intelligent Memory) designed for efficient storage and manipulation of data structures. The key architectural idea in SWIM is to associate some processing logic with each memory chip that allows it to perform data manipulation operations locally and to communicate with a disk or a communication line through a backend port. The processing logic is specially designed to perform operations such as pointer dereferencing, memory indirection, searching and bounds checking efficiently. The I/O subsystem is built using an interconnected ensemble of such memory logic pairs. A complex processing task can now be distributed between a large number of small memory processors each doing a sub-task, while still retaining a common locus of control in the host CPU for higher level administrative and provisioning functions. We argue that active memory based processing enables more powerful, scalable and robust designs for storage and communications subsystems, that can support emerging network services, multimedia workstations and wireless PCS systems. A complete parallel hardware and software system constructed using an array of SWIM elements has been operational for over a year. We present results from application of SWIM to three network functions: a national phone database server, a high performance IP router, and a call screening agent.

Keywords: parallel I/O architecture, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

avalani:channels:
Bhavan Avalani, Alok Choudhary, Ian Foster, and Rakesh Kirshnaiyer. Integrating task and data parallelism using parallel I/O techniques. In Proceedings of the International Workshop on Parallel Processing, Bangalore, India, December 1994.

Keywords: parallel I/O, pario-bib

Comment: They describe using the techniques of delrosario and debenedictis (although without mentioning them) to provide for channels (parallel pipes) between independent data-parallel tasks. The technique really is the same as in debenedictus and delrosario, although they extend it a bit to allow multiple "files" within a channel (why not use multiple channels)? Also, they depend on the program to read and write synchronization variables to control access to the flow of data through the channel. While this may provide good performance in some cases, why not have support for automatic flow control? The system can detect when a portion of the channel is written, and release readers waiting on that portion of the channel (if any). The paper is a bit confusing in its use of the word "file", which seems to be used to mean different things at different points. Also, they seem to use an arbitrary distribution for the "file", which may or may not be the same as one of those used by the two endpoints.

baer:grid-io:
Troy Baer and Pete Wyckoff. A parallel I/O mechanism for distributed systems.. In Proceedings of the 2004 IEEE International Conference on Cluster Computing, pages 63-69, San Diego, CA, September 2004. IEEE Computer Society Press.

Abstract: Access to shared data is critical to the long term success of grids of distributed systems. As more parallel applications are being used on these grids, the need for some kind of parallel I/O facility across distributed systems increases. However, grid middleware has thus far had only limited support for distributed parallel I/O. In this paper, we present an implementation of the MPI-2 I/O interface using the Globus GridFTP client API. MPI is widely used for parallel computing, and its I/O interface maps onto a large variety of storage systems. The limitations of using GridFTP as an MPI-I/O transport mechanism are described, as well as support for parallel access to scientific data formats such as HDF and NetCDF. We compare the performance of GridFTP to that of NFS on the same network using several parallel I/O benchmarks. Our tests indicate that GridFTP can be a workable transport for parallel I/O, particularly for distributed read-only access to shared data sets. (26 refs.)

Keywords: grid I/O, MPI-I/O, grid middleware, gridFTP, pario-bib

bagrodia:sio-character:
Rajive Bagrodia, Andrew Chien, Yarsun Hsu, and Daniel Reed. Input/output: Instrumentation, characterization, modeling and management policy. Technical Report CCSF-41, Scalable I/O Initiative, Caltech Concurrent Supercomputing Facilities, Caltech, 1994.

Keywords: parallel I/O, pario-bib, prefetching, caching, multiprocessor file system, file access pattern

Comment: Basically there are two parts to this paper. First, they will instrument applications, Intel PFS, and IBM Vesta, to trace I/O-related activity. Then they'll use Pablo to analyze and characterize. They plan to trace some events in detail, and the rest with histogram counters. Second, they plan to develop caching and prefetching policies and to analyze those with simulation, analysis, and implementation. They note that IBM and Intel are developing parallel I/O architecture simulators. See also poole:sio-survey, choudhary:sio-language, bershad:sio-os.

bairavasundaram:x-ray:
Lakshmi N. Bairavasundaram, Muthian Sivathanu, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. X-RAY: A non-invasive exclusive caching mechanism for RAIDs. In Proceedings of the 31st Annual International Symposium on Computer Architecture, pages 176-187, Munich, Germany, June 2004. IEEE Computer Society Press.

Abstract: RAID storage arrays often possess gigabytes of RAM for caching disk blocks. Currently, most RAID systems use LRU or LRU-like policies to manage these caches. Since these array caches do not recognize the presence of file system buffer caches, they redundantly retain many of the same blocks as those cached by the file system, thereby wasting precious cache space. In this paper, we introduce X-RAY, an exclusive RAID array caching mechanism. X-RAY achieves a high degree of (but not perfect) exclusivity through gray-box methods: by observing which files have been accessed through updates to file system meta-data, X-RAY constructs an approximate image of the contents of the file system cache and uses that information to determine the exclusive set of blocks that should be cached by the array. We use microbenchmarks to demonstrate that X-RAY's prediction of the file system buffer cache contents is highly accurate, and trace-based simulation to show that X-RAY considerably outperforms LRU and performs as well as other more invasive approaches. The main strength of the X-RAY approach is that it is easy to deploy - all performance gains are achieved without changes to the SCSI protocol or the file system above.

Keywords: RAID, x-ray, caching policies, pario-bib

baird:disa:
R. Baird, S. Karamooz, and H. Vazire. Distributed information storage architecture. In Proceedings of the Twelfth IEEE Symposium on Mass Storage Systems, pages 145-155, 1993.

Keywords: parallel I/O, distributed file system, mass storage, pario-bib

Comment: Architecture for distributed information storage. Integrates file systems, databases, etc. Single system image, lots of support for administration. O-O model, with storage device objects, logical device objects, volume objects, and file objects. Methods for each type of object, including administrative methods.

bakker:semantic:
J.A. Bakker. Semantic partitioning as a basis for parallel I/O in database management systems. Parallel Computing, 26(11):1491-1513, October 2000.

Abstract: Modern applications such as `video on demand' require fast reading of complete files, which can be supported well by file striping. Many conventional applications, however, are only interested in some part of the available records. In order to avoid reading attributes irrelevant to such applications, each attribute could be stored in a separate (transposed) file; Aiming at I/O parallelism, byte-oriented striping could be applied to transposed files. However, such a fragmentation ignores the semantics of data. This fragmentation cannot be optimized by a database management system (DBMS) because a DBMS has to perform its tasks on the basis of data semantics. For example, queries must be translated into file operations using a scheme that maps a data model to a file system. However, details about files, such as the striping width, are invisible to a DBMS. Therefore, we propose to store each transposed file related to a composite type on a separate, independent disk drive, which means I/O parallelism tuned to a data model. As we also aim at system reliability and data availability, each transposed file must be duplicated on another drive. Consequently, a DBMS also has to guarantee correctness and completeness of the allocation of transposed files within an array of disk drives. As a solution independent of the underlying data model, we propose an abstract framework consisting of a meta model and a set of rules

Keywords: database, parallel I/O, pario-bib

baldwin:hyperfs:
C. H. Baldwin and W. C. Nestlerode. A large scale file processing application on a hypercube. In Proceedings of the Fifth Annual Distributed-Memory Computer Conference, pages 1400-1404, 1990.

Keywords: multiprocessor file system, file access pattern, parallel I/O, hypercube, pario-bib

Comment: Census-data processing on an nCUBE/10 at USC. Their program uses an interleaved pattern, which is like my lfp or gw with multi-record records (i.e., the application does its own blocking). Shifted to asynchronous I/O to do OBL manually. Better results if they did more computation per I/O (of course).

baptist:fft:
Lauren M. Baptist. Two algorithms for performing multidimensional, multiprocessor, out-of-core FFTs. Technical Report PCS-TR99-350, Dept. of Computer Science, Dartmouth College, Hanover, NH, June 1999.

Abstract: We show two algorithms for computing multidimensional Fast Fourier Transforms (FFTs) on a multiprocessor system with distributed memory when problem sizes are so large that the data do not fit in the memory of the entire system. Instead, data reside on a parallel disk system and are brought into memory in sections. We use the Parallel Disk Model for implementation and analysis.

The first method is a straightforward out-of-core variant of a well-known method for in-core, multidimensional FFTs. It performs 1-dimensional FFT computations on each dimension in turn. This method is easy to generalize to any number of dimensions, and it also readily permits the individual dimensions to be of any sizes that are integer powers of 2. The key step is an out-of-core transpose operation that places the data along each dimension into contiguous positions on the parallel disk system so that the data for the 1-dimensional FFTs are contiguous.

The second method is an adaptation of another well-known method for in-core, multidimensional FFTs. This method computes all dimensions simultaneously. It is more difficult to generalize to arbitrary radices and number of dimensions in this method than in the first method. Our present implementation is therefore limited to two dimensions of equal size, that are again integer powers of 2.

We present I/O complexity analyses for both methods as well as empirical results for a DEC 2100 server and an SGI Origin 2000, each of which has a parallel disk system. Our results indicate that the methods are comparable in speed in two-dimensions.

Keywords: parallel I/O, out of core, FFT, parallel algorithm, scientific computing, pario-bib

Comment: Undergraduate Honors Thesis. Advisor: Tom Cormen.

barak:hfs:
Amnon Barak, Bernard A. Galler, and Yaron Farber. A holographic file system for a multicomputer with many disk nodes. Technical Report 88-6, Dept. of Computer Science, Hebrew University of Jerusalem, May 1988.

Keywords: parallel I/O, hashing, reliability, disk mirroring, pario-bib

Comment: Describes a file system for a distributed system that scatters records of each file over many disks using hash functions. The hash function is known by all processors, so no one processor must be up to access the file. Any portion of the file whose disknode is available may be accessed. Shadow nodes are used to take over for nodes that go down, saving the info for later use by the proper node. Intended to easily parallelize read/write accesses and global file operations, and to increase file availability.

barve:bus:
Rakesh Barve, Elizabeth Shriver, Phillip B. Gibbons, Bruce K. Hillyer, Yossi Matias, and Jeffrey Scott Vitter. Modeling and optimizing I/O throughput of multiple disks on a bus (summary). In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, pages 264-265. ACM Press, June 1998.
See also later version barve:bus2.

Keywords: disk model, I/O bus, device model, I/O model, pario-bib

barve:bus2:
Rakesh Barve, Jeffrey Vitter, Elizabeth Shriver, Phillip Gibbons, Bruce Hillyer, and Yossi Matias. Modeling and optimizing I/O throughput of multiple disks on a bus. In Proceedings of the 1999 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 83-92. ACM Press, June 1999.
See also earlier version barve:bus.

Keywords: disk model, I/O bus, device model, I/O model, pario-bib

barve:competitive2:
Rakesh Barve, Mahesh Kallahalla, Peter J. Varman, and Jeffrey Scott Vitter. Competitive parallel disk prefetching and buffer management. In Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed Systems, pages 47-56, San Jose, CA, November 1997. ACM Press.

Abstract: We provide a competitive analysis framework for online prefetching and buffer management algorithms in parallel I/O systems, using a read-once model of block references. This has widespread applicability to key I/O-bound applications such as external merging and concurrent playback of multiple video streams. Two realistic lookahead models, global lookahead and local lookahead, are defined. Algorithms NOM and GREED based on these two forms of lookahead are analyzed for shared buffer and distributed buffer configurations, both of which occur frequently in existing systems. An important aspect of our work is that we show how to implement both the models of lookahead in practice using the simple techniques of forecasting and flushing.

Given a D-disk parallel I/O system and a globally shared I/O buffer that can hold upto M disk blocks, we derive a lower bound of $Ω(\sqrt{D}$) on the competitive ratio of any deterministic online prefetching algorithm with O(M) lookahead. NOM is shown to match the lower bound using global M-block lookahead. In contrast, using only local lookahead results in an $Ω(D)$ competitive ratio. When the buffer is distributed into D portions of M/D blocks each, the algorithm GREED based on local lookahead is shown to be optimal, and NOM is within a constant factor of optimal. Thus we provide a theoretical basis for the intuition that global lookahead is more valuable for prefetching in the case of a shared buffer configuration whereas it is enough to provide local lookahead in case of the distributed configuration. Finally, we analyze the performance of these algorithms for reference strings generated by a uniformly-random stochastic process and we show that they achieve the minimal expected number of I/Os. These results also give bounds on the worst-case expected performance of algorithms which employ randomization in the data layout.

Keywords: disk prefetching, file caching, parallel I/O, pario-bib

Comment: See also barve:competitive. They propose two methods for scheduling prefetch operations in the situation where the access pattern is largely known in advance, in such a way as to minimize the total number of parallel I/Os. The two methods are quite straightforward, and yet match the optimum lower bound for an on-line algorithm.

barve:jmergesort:
Rakesh D. Barve, Edward F. Grove, and Jeffrey S. Vitter. Simple randomized mergesort on parallel disks. Parallel Computing, 23(4):601-631, June 1997.
See also earlier version barve:mergesort.

Abstract: We consider the problem of sorting a file of N records on the D-disk model of parallel I/O in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of B contiguous records. In each I/O operation, up to one block can be transferred to or from each of the D disks in parallel. We propose a simple, efficient, randomized mergesort algorithm called SRM that uses a forecast-and-flush approach to overcome the inherent difficulties of simple merging on parallel disks. SRM exhibits a limited use of randomization and also has a useful deterministic version. Generalizing the technique of forecasting, our algorithm is able to read in, at any time, the ``right'' block from any disk, and using the technique of flushing, our algorithm evicts, without any I/O overhead, just the ``right'' blocks from memory to make space for new ones to be read in. The disk layout of SRM is such that it enjoys perfect write parallelism, avoiding fundamental inefficiencies of previous mergesort algorithms. By analysis of generalized maximum occupancy problems we are able to derive an analytical upper bound on SRM's expected overhead valid for arbitrary inputs.

The upper bound derived on expected I/O performance of SRM indicates that SRM is provably better than disk-striped mergesort (DSM) for realistic parameter values D, M, and B. Average-case simulations show further improvement on the analytical upper bound. Unlike previously proposed optimal sorting algorithms, SRM outperforms DSM even when the number D of parallel disks is small.

Keywords: parallel I/O algorithm, sorting, pario-bib

Comment: This paper formerly called barve:mergesort; I discovered that the paper had appeared in SPAA96, so the SPAA96 paper is now called barve:mergesort.

barve:mergesort:
Rakesh D. Barve, Edward F. Grove, and Jeffrey S. Vitter. Simple randomized mergesort on parallel disks. In Proceedings of the Eighth Symposium on Parallel Algorithms and Architectures, pages 109-118, Padua, Italy, June 1996. ACM Press.
See also later version barve:jmergesort.

Keywords: parallel I/O algorithm, sorting, pario-bib

barve:round:
Rakesh Barve, Phillip B. Gibbons, Bruce K. Hillyer, Yossi Matias, Elizabeth Shriver, and Jeffrey Scott Vitter. Round-like behavior in multiple disks on a bus. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems, pages 1-9, Atlanta, GA, May 1999. ACM Press.

Abstract: In modern I/O architectures, multiple disk drives are attached to each I/O bus. Under I/O-intensive workloads, the disk latency for a request can be overlapped with the disk latency and data transfers of requests to other disks, potentially resulting in an aggregate I/O throughput at nearly bus bandwidth. This paper reports on a performance impairment that results from a previously unknown form of convoy behavior in disk I/O, which we call rounds. In rounds, independent requests to distinct disks convoy, so that each disk services one request before any disk services its next request. We analyze log files to describe read performance of multiple Seagate Wren-7 disks that share a SCSI bus under a heavy workload, demonstrating the rounds behavior and quantifying its performance impact.

Keywords: disk, I/O bus, parallel I/O, pario-bib

batcher:staran:
K. E. Batcher. STARAN parallel processor system hardware. AFIPS Conference Proceedings, pages 405-410, 1974.

Keywords: parallel architecture, array processor, parallel I/O, SIMD, pario-bib

Comment: This paper is reproduced in Kuhn and Padua's (1981, IEEE) survey ``Tutorial on Parallel Processing.'' The STARAN is an array processor that uses Multi-Dimensional-Access (MDA) memories and permutation networks to access data in bit slices in a variety of ways, with high-speed I/O capabilities. Its router (called the flip network) could permute data among the array processors, or between the array processors and external devices, including disks, video input, and displays.

baylor:methodology:
Sandra Johnson Baylor, Caroline Benveniste, and Leo J. Beolhouwer. A methodology for evaluating parallel I/O performance for massively parallel processors. In Proceedings of the 27th Annual Simulation Symposium, pages 31-40, April 1994.

Keywords: parallel I/O, parallel architecture, simulation, pario-bib

baylor:perfeval:
Sandra Johnson Baylor, Caroline B. Benveniste, and Yarsun Hsu. Performance evaluation of a parallel I/O architecture. In Proceedings of the 9th ACM International Conference on Supercomputing, pages 404-413, Barcelona, July 1995. ACM Press.
See also earlier version baylor:perfeval-tr.

Keywords: performance evaluation, parallel architecture, parallel I/O, pario-bib

Comment: They use a simulator to evaluate the performance of a parallel I/O system. They simulate the network and disks under a synthetic workload, and measure the time it takes for I/O requests to traverse the network, be processed, and return. They also measure the impact of I/O requests on non-I/O messages. Their results are fairly unsurprising.

baylor:perfeval-tr:
Sandra Johnson Baylor, Caroline B. Benveniste, and Yarsun Hsu. Performance evaluation of a parallel I/O architecture. Technical Report RC 20049, IBM T. J. Watson Research Center, May 1995.
See also later version baylor:perfeval.

Keywords: performance evaluation, parallel architecture, parallel I/O, pario-bib

baylor:vulcan-perf:
Sandra Johnson Baylor, Caroline Benveniste, and Yarsun Hsu. Performance evaluation of a massively parallel I/O subsystem. In Proceedings of the IPPS '94 Workshop on Input/Output in Parallel Computer Systems, pages 1-15. IBM Watson Research Center, 1994. Also appeared in Computer Architecture News 22(4).
See also later version baylor:vulcan-perf-book.

Keywords: parallel I/O, parallel architecture, performance analysis, pario-bib

Comment: See polished version baylor:vulcan-perf-book. Simulation of the I/O architecture for the Vulcan MPP at IBM TJW. This is a distributed-memory MIMD system with a bidirectional omega-type interconnection network, and separate compute and I/O nodes. They use a stochastic workload to evaluate the average I/O performance under a few different situations, and then use that average performance, along with a stochastic workload, in a detailed simulation of the interconnection network. (What would be the effect of adding variance to the I/O-node performance?) A key point is that the I/O node will not accept any more requests until a current write request is finished being processed (copied into the write-back cache). If there are many writes, this can backup the network (would a different write-request protocol help?) Not clear how concurrency of reads are modeled. Results show that network saturates for high request rates and small number of I/O nodes. As request rate decreases or number of I/O nodes increases, performance levels off to a reasonable value. Placement of I/O nodes didn't make much difference, nor did extra non-I/O traffic. Given their parameters, and for reasonable loads, 1 I/O node per 4 compute nodes was a reasonable balance, and was scalable.

baylor:vulcan-perf-book:
Sandra Johnson Baylor, Caroline Benveniste, and Yarsun Hsu. Performance evaluation of a massively parallel I/O subsystem. In Jain et al. [iopads-book], chapter 13, pages 293-311.
See also earlier version baylor:vulcan-perf.

Abstract: Presented are the trace-driven simulation results of a study conducted to evaluate the performance of the internal parallel I/O subsystem of the Vulcan massively parallel processor (MPP) architecture. The system sizes evaluated vary from 16 to 512 nodes. The results show that a compute node to I/O node ratio of four is the most cost effective for all system sizes, suggesting high scalability. Also, processor-to-processor communication effects are negligible for small message sizes and the greater the fraction of I/O reads, the better the I/O performance. Worse case I/O node placement is within 13% of more efficient placement strategies. Introducing parallelism into the internal I/O subsystem improves I/O performance significantly.

Keywords: parallel I/O architecture, performance evaluation, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

baylor:workload:
Sandra Johnson Baylor and C. Eric Wu. Parallel I/O workload characteristics using Vesta. In Proceedings of the IPPS '95 Workshop on Input/Output in Parallel and Distributed Systems, pages 16-29, April 1995.
See also later version baylor:workload-book.

Abstract: In recent years, the design and performance evaluation of parallel processors has focused on the processor, memory and communication subsystems. As a result, these subsystems have better performance potential than the I/O subsystem. In fact, the I/O subsystem is the bottleneck in many machines. However, there are a number of studies currently underway to improve the design of parallel I/O subsystems. To develop optimal parallel I/O subsystem designs, one must have a thorough understanding of the workload characteristics of parallel I/O and its exploitation of the associated parallel file system. Presented are the results of a study conducted to analyze the parallel I/O workloads of several applications on a parallel processor using the Vesta parallel file system. Traces of the applications are obtained to collect system events, communication events, and parallel I/O events. The traces are then analyzed to determine workload characteristics. The results show I/O request rates on the order of hundreds of requests per second, a large majority of requests are for small amount of data (less than 1500 bytes), a few requests are for large amounts of data (on the order of megabytes), significant file sharing among processes within a job, and strong temporal, traditional spatial, and interprocess spatial locality.

Keywords: parallel I/O, workload characterization, pario-bib

Comment: See polished version baylor:workload-book. They characterize four parallel applications: sort, matrix multiply, seismic migration, and video server, in terms of their I/O activity. They found results that are consistent with kotz:workload, in that they also found lots of small data requests, some large data requests, significant file sharing and interprocess locality. This study found less of the non-contiguous access than did kotz:workload, because of the logical views provided by Vesta. Note on-line postscript does not include figures.

baylor:workload-book:
Sandra Johnson Baylor and C. Eric Wu. Parallel I/O workload characteristics using Vesta. In Jain et al. [iopads-book], chapter 7, pages 167-185.
See also earlier version baylor:workload.

Abstract: To develop optimal parallel I/O subsystems, one must have a thorough understanding of the workload characteristics of parallel I/O and its exploitation of the associated parallel file system. Presented are the results of a study conducted to analyze the parallel I/O workloads of several applications on a parallel processor using the Vesta parallel file system. Traces of the applications are obtained to collect system events, communication events, and parallel I/O events. The traces are then analyzed to determine workload characteristics. The results show I/O request rates on the order of hundreds of requests per second, a large majority of requests are for small amounts of data (less than 1500 bytes), a few requests are for large amounts of data (on the order of megabytes), significant file sharing among processes within a job, and strong temporal, traditional spatial, and interprocess spatial locality.

Keywords: parallel I/O, file access pattern, workload characterization, file system workload, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

bbn:admin:
BBN Advanced Computers Inc. TC2000 System Administration Guide, revision 3.0 edition, April 1991.

Keywords: BBN, parallel I/O, pario-bib

Comment: Administrative manual for the TC2000 I/O system. Can stripe over partitions in a user-specified set of disks. Large requests automatically split and done in parallel. See also garber:tc2000.

becher:ooc-solver:
Jonathan D. Becher and John F. Porter. Out of core dense solvers for the MasPar parallel computer. Technical Report MP/IP/SP-37.94, MasPar Computer Corporation, 1994.

Keywords: parallel I/O, scientific computing, linear algebra, pario-bib

Comment: They look at out-of-core block and slab solvers for the Maspar. They overlap reading one block with the computation of the previous block. They solve matrices up to 40k x 40k, and obtain 3.14 GFlops even with I/O considered.

bell:physics:
Jean L. Bell. A specialized data management system for parallel execution of particle physics codes. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 277-285, Chicago, IL, 1988. ACM Press.

Keywords: file access pattern, disk prefetch, file system, pario-bib

Comment: A specialized database system for particle physics codes. Valuable for its description of access patterns and subsequent file access requirements. Particle-in-cell codes iterate over timesteps, updating the position of each particle, and then the characteristics of each cell in the grid. Particles may move from cell to cell. Particle update needs itself and nearby gridcell data. The whole dataset is too big for memory, and each timestep must be stored on disk for later analysis anyway. Regular file systems are inadequate: specialized DBMS is more appropriate. Characteristics needed by their application class: multidimensional access (by particle type or by location, i.e., multiple views of the data), coordination between grid and particle data, coordination between processors, coordinated access to meta-data, inverted files, horizontal clustering, large blocking of data, asynchronous I/O, array data, complicated joins, and prefetching according to user-prespecified order. Note that many of these things can be provided by a file system, but that most are hard to come by in typical file systems, if not impossible. Many of these features are generalizable to other applications.

benner:pargraphics:
Robert E. Benner. Parallel graphics algorithms on a 1024-processor hypercube. In Proceedings of the Fourth Conference on Hypercube Concurrent Computers and Applications, pages 133-140, Monterey, CA, 1989. Golden Gate Enterprises, Los Altos, CA.

Keywords: hypercube, graphics, parallel algorithm, parallel I/O, pario-bib

Comment: About using the nCUBE/10's RT Graphics System. They were frustrated by an unusual mapping from the graphics memory to the display, a shortage of memory on the graphics nodes, and small message buffers on the graphics nodes. They wrote some algorithms for collecting the columns of pixels from the hypercube nodes, and routing them to the appropriate graphics node. They also would have liked a better interconnection network between the graphics nodes, at least for synchronization.

bennett:jovian:
Robert Bennett, Kelvin Bryant, Alan Sussman, Raja Das, and Joel Saltz. Jovian: A framework for optimizing parallel I/O. In Proceedings of the Scalable Parallel Libraries Conference, pages 10-20, Mississippi State, MS, October 1994. IEEE Computer Society Press.

Keywords: parallel I/O, pario-bib

Comment: Jovian is a runtime library for use with SPMD codes, eg, HPF. They restrict IO to collective operations, and provide extra processes to 'coalesce' the many requests from multiple CPs into fewer larger requests to the operating system, perhaps optimized for access order. They mention that there is a standardization process underway for specifying data distributions. Also a compact representation for strided access to n-dimensional data structures. Coalescing basically means combining requests to eliminate duplication and to combine adjacent requests. Requests to coalescers are in full blocks, to lower the processing overhead. Nonetheless, their method involves moving requests around twice, and involve several memory-memory copies of the data, so their overhead is high.

berdahl:transport:
Lawrence Berdahl. Parallel transport protocol proposal. Lawrence Livermore National Labs, January 3, 1995. Draft.
See also earlier version berdahl:woodenman.

Keywords: parallel I/O, network, supercomputer system, pario-bib

Comment: An update of berdahl:woodenman, close to the final draft.

berdahl:woodenman:
Lawrence Berdahl. Parallel data exchange. Lawrence Livermore National Labs, January 28, 1994. WoodenMan Proposal.
See also later version berdahl:transport.

Keywords: parallel I/O, network, supercomputer system, pario-bib

Comment: They describe a protocol for making parallel data transfers of arbitrary data sets from one set of data servers to another set of data servers. The goal is to be independent of specific architectures or even types of data servers, and to work on top of existing transport protocols. The data set is described using a gather set for the source and a scatter set for the destination, and using a linear address space as an intermediate representation. All the servers are contacted, they figure out who they need to talk, and exchange port information with them. Each pair exchanges votes on who will control the transfer (ie, who will control the order of the transfer), and on their maximum data rates. This information is used to settle on the control and set of ports to be used. This proposal is not final and is under active development, so it may change.

berrendorf:paragon:
R. Berrendorf, H. Burg, and U. Detert. Performance characteristics of parallel computers: Intel Paragon case study. {IT+TI} Informationstechnik und Technische Informatik, 37(2):37-45, April 1995. (In German).

Keywords: parallel computing, performance evaluation, parallel file system, pario-bib

Comment: In German. They summarize typical performance of the Intel Paragon, including the communication performance and the parallel file-system performance.

berry:nasa:
Michael R. Berry and Tarek A. El-Ghazawi. Parallel input/output characteristics of NASA science applications. In Proceedings of the Tenth International Parallel Processing Symposium, pages 741-747, Honolulu, April 1996. IEEE Computer Society Press.

Keywords: scientific computation, application, parallel I/O, pario-bib

bershad:sio-os:
Brian Bershad, David Black, David DeWitt, Garth Gibson, Kai Li, Larry Peterson, and Marc Snir. Operating system support for high-performance parallel I/O systems. Technical Report CCSF-40, Scalable I/O Initiative, Caltech Concurrent Supercomputing Facilities, Caltech, 1994.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: Four major components: networking, memory servers, file system, and persistent object store. Networking part focuses on low-latency support communication within an application, between applications, and between machines (Bershad and Peterson). Memory servers, shared virtual memory, and checkpointing support (Kai Li). File systems support includes benchmarking, transparent informed prefetching (Gibson), a common interface for PFS and Vesta (Snir), and integrating secondary and tertiary storage systems (including the integration of the National Storage Lab's HPSS (see coyne:hpss) into this project in 1995). OSF/1 (Black) will be extended to support parallel file systems, extent-like behavior, and block coalescing. Persistent object store (DeWitt) is radical change to an object-oriented interface, transparent I/O (though extensible and changable with subclassing, presumably), and heterogeneous support via the Object Definition Language standard. Persistent objects may be integrated with the memory servers and shared virtual memory. See also poole:sio-survey, bagrodia:sio-character, choudhary:sio-language.

berson:multimedia:
Steven Berson, Leana Golubchik, and Richard R. Muntz. Fault tolerant design of multimedia servers. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 364-375. ACM Press, 1995.

Keywords: fault tolerance, multimedia, video on demand, parallel I/O, pario-bib

best:cmmdio:
Michael L. Best, Adam Greenberg, Craig Stanfill, and Lewis W. Tucker. CMMD I/O: A parallel Unix I/O. In Proceedings of the Seventh International Parallel Processing Symposium, pages 489-495, Newport Beach, CA, 1993. IEEE Computer Society Press.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: Much like Intel CFS, with different I/O modes that determine when the compute nodes synchronize, and the semantics of I/Os written to the file. They found it hard to get good bandwidth for independent I/Os, as opposed to coordinated I/Os; part of this was due to their RAID 3 disk array, but it is more complicated than that. Some performance numbers were given in talk.

bestavros:raid:
Azer Bestavros. IDA-based redundant arrays of inexpensive disks. In Proceedings of the First International Conference on Parallel and Distributed Information Systems, pages 2-9, December 1991.

Keywords: RAID, disk array, reliability, parallel I/O, pario-bib

Comment: Uses the Information Dispersal Algorithm (IDA) to generate $n+m$ blocks from $n$ blocks, to tolerate $m$ disk failures; all of the data from the $n$ blocks is hidden in the $n+m$ blocks. Not with the RAID project.

bester:gass:
Joseph Bester, Ian Foster, Carl Kesselman, Jean Tedesco, and Steven Tuecke. GASS: A data movement and access service for wide area computing systems. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems, pages 78-88, Atlanta, GA, May 1999. ACM Press.

Abstract: In wide area computing, programs frequently execute at sites that are distant from their data. Data access mechanisms are required that place limited functionality demands on an application or host system yet permit high-performance implementations. To address these requirements, we propose a data movement and access service called Global Access to Secondary Storage (GASS). This service defines a global name space via Uniform Resource Locators and allows applications to access remote files via standard I/O interfaces. High performance is achieved by incorporating default data movement strategies that are specialized for I/O patterns common in wide area applications and by providing support for programmer management of data movement. GASS forms part of the Globus toolkit, a set of services for high-performance distributed computing. GASS itself makes use of Globus services for security and communication, and other Globus components use GASS services for executable staging and real-time remote monitoring. Application experiences demonstrate that the library has practical utility.

Keywords: wide-area network, parallel I/O, pario-bib

beynon:datacutter:
Michael D. Beynon, Renato Ferreira, Tahsin Kurc, Alan Sussman, and Joel Saltz. DataCutter: Middleware for filtering very large scientific datasets on archival storage systems. In Proceedings of the 2000 Mass Storage Systems Conference, pages 119-133, College Park, MD, March 2000. IEEE Computer Society Press.

Keywords: data grid, filter, pario-bib

bitton:schedule:
Dina Bitton. Arm scheduling in shadowed disks. In Proceedings of IEEE Compcon, pages 132-136, Spring 1989.

Keywords: parallel I/O, disk shadowing, reliability, disk mirroring, disk optimization, pario-bib

Comment: Goes further than bitton:shadow. Uses simulation to verify results from that paper, which were expressions for the expected seek distance of shadowed disks, using shortest-seek-time arm scheduling. Problem is her assumption that arm positions stay independent, in the face of correlating effects like writes, which move all arms to the same place. Simulations match model only barely, and only in some cases. Anyway, shadowed disks can improve performance for workloads more than 60 or 70% reads.

bitton:shadow:
D. Bitton and J. Gray. Disk shadowing. In Proceedings of the 14th International Conference on Very Large Data Bases, pages 331-338, 1988.

Keywords: parallel I/O, disk shadowing, reliability, disk mirroring, disk optimization, pario-bib

Comment: Also TR UIC EECS 88-1 from Univ of Illinois at Chicago. Shadowed disks are mirroring with more than 2 disks. Writes to all disks, reads from one with shortest seek time. Acknowledges but ignores problem posed by lo:disks. Also considers that newer disk technology does not have linear seek time $(a+bx)$ but rather $(a+b\sqrt{x})$. Shows that with either seek distribution the average seek time for workloads with at least 60% reads decreases in the number of disks. See also bitton:schedule.

bjorstad:structure:
P. E. Bj\orstad and J. Cook. Large scale structural analysis on massively parallel computers. In Linear Algebra for Large Scale and Real-Time Applications, pages 3-11. Kluwer Academic Publishers, 1993. ftp from ftp.ii.uib.no in \verb+pub/tech_reports/mpp_sestra.ps.Z+.

Keywords: parallel I/O, file access pattern, pario-bib

Comment: A substantial part of this structural-analysis application was involved in I/O, moving substructures in and out of RAM. The Maspar IO-RAM helped a lot, nearly halving the time required. On the Cray, the SSD had an even bigger impact, perhaps 7-12 times faster. Their main conclusion is that caching helped. Most likely this was due to its double-buffering, since they structured the code to read/compute/write in large ``superblocks''.

blaum:evenodd:
Mario Blaum, Jim Brady, Jehoshua Bruck, Jai Menon, and Alexander Vardy. The EVENODD code and its generalization: An efficient scheme for tolerating multiple disk failures in RAID architectures. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 14, pages 187-208. IEEE Computer Society Press and Wiley, New York, NY, 2001.

Keywords: disk array, RAID, parallel I/O, pario-bib

Comment: Part of jin:io-book.

bonachea:java-io:
Dan Bonachea, Phillip Dickens, and Rajeev Thakur. High-performance file I/O in Java: Existing approaches and bulk I/O extensions. Concurrency and Computation: Practice and Experience, 13(8-9):713-736, 2001.
See also earlier version bonachea:java-io-tr.

Abstract: There is a growing interest in using Java as the language for developing high-performance computing applications. To be successful in the high-performance computing domain, however, Java must not only be able to provide high computational performance, but also high-performance I/O. In this paper, we first examine several approaches that attempt to provide high-performance I/O in Java-many of which are not obvious at first glance-and evaluate their performance on two parallel machines, the IBM SP and the SGI Origin2000. We then propose extensions to the Java I/O library that address the deficiencies in the Java I/O API and improve performance dramatically. The extensions add bulk (array) I/O operations to Java, thereby removing much of the overhead currently associated with array I/O in Java. We have implemented the extensions in two ways: in a standard JVM using the Java Native Interface (JNI) and in a high-performance parallel dialect of Java called Titanium. We describe the two implementations and present performance results that demonstrate the benefits of the proposed extensions.

Keywords: parallel I/O, Java, file system interface, pario-bib

bonachea:java-io-tr:
Dan Bonachea, Phillip Dickens, and Rajeev Thakur. High-performance file I/O in Java: Existing approaches and bulk I/O extensions. Technical Report ANL/MCS-P840-0800, Mathematics and Computer Science Division, Argonne National Laboratory, August 2000.
See also later version bonachea:java-io.

Abstract: There is a growing interest in using Java as the language for developing high-performance computing applications. To be successful in the high-performance computing domain, however, Java must not only be able to provide high computational performance, but also high-performance I/O. In this paper, we first examine several approaches that attempt to provide high-performance I/O in Java-many of which are not obvious at first glance-and evaluate their performance on two parallel machines, the IBM SP and the SGI Origin2000. We then propose extensions to the Java I/O library that address the deficiencies in the Java I/O API and improve performance dramatically. The extensions add bulk (array) I/O operations to Java, thereby removing much of the overhead currently associated with array I/O in Java. We have implemented the extensions in two ways: in a standard JVM using the Java Native Interface (JNI) and in a high-performance parallel dialect of Java called Titanium. We describe the two implementations and present performance results that demonstrate the benefits of the proposed extensions.

Keywords: parallel I/O, java, file system interface, pario-bib

boral:bubba:
Haran Boral, William Alexander, Larry Clay, George Copeland, Scott Danforth, Michael Franklin, Brian Hart, Marc Smith, and Patrick Valduriez. Prototyping Bubba, a highly parallel database system. IEEE Transactions on Knowledge and Data Engineering, 2(1), March 1990.

Keywords: parallel I/O, database, disk caching, pario-bib

Comment: More recent than copeland:bubba, and a little more general. This gives few details, and doesn't spend much time on the parallel I/O. Bubba does use parallel independent disks, with a significant effort to place data on the disks, and do the work local to the disks, to balance the load and minimize interprocessor communication. Also they use a single-level store (i.e., memory-mapped files) to improve performance of their I/O system, including page locking that is assisted by the MMU. The OS has hooks for the database manager to give memory-management policy hints.

boral:critique:
H. Boral and D. DeWitt. Database machines: an idea whose time has passed?. In Proceedings of the Second International Workshop on Database Machines, pages 166-187. Springer-Verlag, 1983.

Keywords: file access pattern, parallel I/O, database machine, pario-bib

Comment: Improvements in I/O bandwidth crucial for supporting database machines, otherwise highly parallel DB machines are useless (I/O bound). Two ways to do it: 1) synchronized interleaving by using custom controller and regular disks to read/write same track on all disks, which speeds individual accesses. 2) use very large cache (100-200M) to keep blocks to re-use and to do prefetching. But see dewitt:pardbs.

bordawekar:collective:
Rajesh Bordawekar. Implementation of collective I/O in the Intel Paragon parallel file system: Initial experiences. In Proceedings of the 11th ACM International Conference on Supercomputing, pages 20-27. ACM Press, July 1997.
See also earlier version bordawekar:collective-tr.

Keywords: collective I/O, multiprocessor file system, parallel I/O, pario-bib

Comment: bordawekar:collective was renamed bordawekar:collective-tr, so this could be called bordawekar:collective.

bordawekar:collective-tr:
Rajesh Bordawekar. Implementation and evaluation of collective I/O in the Intel Paragon Parallel File System. Technical Report CACR TR-128, Center of Advanced Computing Research, California Insititute of Technology, November 1996.
See also later version bordawekar:collective.

Abstract: A majority of parallel applications obtain parallelism by partitioning data over multiple processors. Accessing distributed data structures like arrays from files often requires each processor to make a large number of small non-contiguous data requests. This problem can be addressed by replacing small non-contiguous requests by large collective requests. This approach, known as Collective I/O, has been found to work extremely well in practice. In this paper, we describe implementation and evaluation of a collective I/O prototype in a production parallel file system on the Intel Paragon. The prototype is implemented in the PFS subsystem of the Intel Paragon Operating System. We evaluate the collective I/O performance using its comparison with the PFS M_RECORD and M_UNIX I/O modes. It is observed that collective I/O provides significant performance improvement over accesses in M_UNIX mode. However, in many cases, various implementation overheads cause collective I/O to provide lower performance than the M_RECORD I/O mode.

Keywords: parallel I/O, mutliprocessor file system, pario-bib

Comment: This tech report was called bordawekar:collective, then renamed bordawekar:collective-tr, on the appearance of the ICS paper bordawekar:collective.

bordawekar:comm:
Rajesh Bordawekar and Alok Choudhary. Communication strategies for out-of-core programs on distributed memory machines. In Proceedings of the 9th ACM International Conference on Supercomputing, pages 395-403, Barcelona, July 1995. ACM Press.
See also earlier version bordawekar:comm-tr.

Keywords: parallel I/O, inter-processor communication, pario-bib

Comment: bordawekar:comm-tr is nearly identical in content. Also bordawekar:commstrat is a shorter version.

bordawekar:comm-tr:
Rajesh Bordawekar and Alok Choudhary. Communication strategies for out-of-core programs on distributed memory machines. Technical Report SCCS-667, NPAC, Syracuse University, 1994.
See also later version bordawekar:comm.

Abstract: In this paper, we show that communication in the out-of-core distributed memory problems requires both inter-processor communication and file I/O. Given that primary data structures reside in files, even communication requires I/O. Thus, it is important to optimize the I/O costs associated with a communication step. We present three methods for performing communication in out-of-core distributed memory problems. The first method, termed as the "out-of-core" communication method, follows a loosely synchronous model. Computation and Communication phases in this case are clearly separated, and communication requires permutation of data in files. The second method, termed as "demand-driven-in-core communication" considers only communication required of each in-core data slab individually. The third method, termed as "producer-driven-in-core communication" goes even one step further and tries to identify the potential (future) use of data while it is in memory. We describe these methods in detail and provide performance results for out-of-core applications; namely, two-dimensional FFT and two-dimensional elliptic solver. Finally, we discuss how "out-of-core" and "in-core" communication methods could be used in virtual memory environments on distributed memory machines.

Keywords: parallel I/O, inter-processor communication, pario-bib

Comment: They compare different ways to do global communications in out-of-core applications, involving file I/O and communication at different times. They also comment briefly on how it would work if it depended on virtual memory at each node.

bordawekar:commstrat:
Rajesh Bordawekar and Alok Choudhary. Communication strategies for out-of-core programs on distributed memory machines. In Proceedings of the 1995 International Conference on High Performance Computing, pages 130-135, New Delhi, India, December 1995.
See also earlier version bordawekar:comm.

Keywords: interprocessor communication, parallel I/O, pario-bib

Comment: Small version of bordawekar:comm.

bordawekar:compcomm:
Rajesh Bordawekar, Alok Choudhary, and J. Ramanujam. Compilation and communication strategies for out-of-core programs on distributed-memory machines. Journal of Parallel and Distributed Computing, 38(2):277-288, November 1996.
See also earlier version bordawekar:compcomm-tr.

Abstract: It is widely acknowledged that improving parallel I/O performance is critical for widespread adoption of high performance computing. In this paper, we show that communication in out-of-core distributed memory problems may require both inter-processor communication and file I/O. Thus, in order to improve I/O performance, it is necessary to minimize the I/O costs associated with a communication step. We present three methods for performing communication in out-of-core distributed memory problems. The first method called the generalized collective communication method follows a loosely synchronous model; computation and communication phases are clearly separated, and communication requires permutation of data in files. The second method called the receiver-driven in-core communication considers only communication required of each in-core data slab individually. The third method called the owner-driven in-core communication goes even one step further and tries to identify the potential future use of data (by the recipients) while it is in the sender's memory. We describe these methods in detail and present a simple heuristic to choose a communication method from among the three methods. We then provide performance results for two out-of-core applications, the two-dimensional FFT code and the two-dimensional elliptic Jacobi solver. Finally, we discuss how the out-of-core and in-core communication methods can be used in virtual memory environments on distributed memory machines.

Keywords: compiler, communication, out-of-core, parallel I/O, inter-processor communication, pario-bib

bordawekar:compcomm-tr:
Rajesh Bordawekar, Alok Choudhary, and J. Ramanujam. Compilation and communication strategies for out-of-core programs on distributed memory machines. Technical Report CACR-113, Scalable I/O Initiative, Center of Advanced Computing Research, California Insititute of Technology, November 1995.
See also later version bordawekar:compcomm.

Abstract: It is widely acknowledged that improving parallel I/O performance is critical for widespread adoption of high performance computing. In this paper, we show that communication in out-of-core distributed memory problems may require both inter-processor communication and file I/O. Thus, in order to improve I/O performance, it is necessary to minimize the I/O costs associated with a communication step. We present three methods for performing communication in out-of-core distributed memory problems. The first method called the generalized collective communication method follows a loosely synchronous model; computation and communication phases are clearly separated, and communication requires permutation of data in files. The second method called the receiver-driven in-core communication considers only communication required of each in-core data slab individually. The third method called the owner-driven in-core communication goes even one step further and tries to identify the potential future use of data (by the recipients) while it is in the sender's memory. We describe these methods in detail and present a simple heuristic to choose a communication method from among the three methods. We then provide performance results for two out-of-core applications, the two-dimensional FFT code and the two-dimensional elliptic Jacobi solver. Finally, we discuss how the out-of-core and in-core communication methods can be used in virtual memory environments on distributed memory machines.

Keywords: out-of-core, compiler, communication, distributed memory, parallel I/O, pario-bib

Comment: See also bordawekar:comm, at ICS'95.

bordawekar:compiling:
Rajesh Bordawekar and Alok Choudhary. Issues in compiling I/O intensive problems. In Jain et al. [iopads-book], chapter 3, pages 69-96.

Abstract: None.

Keywords: parallel I/O, compiler, out-of-core, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

bordawekar:compositional:
Rajesh Bordawekar. A case for compositional file systems (extended abstract). Technical Report CACR TR-161, Center of Advanced Computing Research, California Insititute of Technology, March 1998.

Abstract: This article presents a case for compositional file systems (CFSs). The CFS is designed using the end-to-end argument; the basic file system attributes, therefore, are independent of the user requirements. The CFS is designed as a functionally compositional, structurally distributed, and dynamically extendable file system. The article also discusses the advantages and implementation alternatives for these file systems, and outlines possible applications.

Keywords: parallel I/O, multiprocessor file system, pario-bib

bordawekar:delta-fs:
Rajesh Bordawekar, Alok Choudhary, and Juan Miguel Del Rosario. An experimental performance evaluation of Touchstone Delta Concurrent File System. In Proceedings of the 7th ACM International Conference on Supercomputing, pages 367-376. ACM Press, 1993.
See also earlier version bordawekar:delta-fs-TR.

Abstract: For a high-performance parallel machine to be a scalable system, it must also have a scalable parallel I/O system. Recently, several commercial machines (e.g. Intel Touchstone Delta, Paragon, CM-5, Ncube-2) have been built that provide features for parallel I/O. However, very little is understood about the performance of these I/O systems. This paper presents an experimental evaluation of the Intel Touchstone Delta's Concurrent File System (CFS). The CFS utilizes the declustering of large files across the disks to improve the I/O performance. Data files can be read or written on the CFS using 4 access modes.

We present performance measurements for the CFS on the Touchstone Delta with 512 compute nodes and 32 I/O nodes. The study focuses on file read/write rates for various configurations of I/O and compute nodes. The study attempts to show the effect of access modes, buffer sizes and volume restrictions on the system performance. The paper also shows that the performance of the CFS can greatly vary for various data distributions commonly employed in scientific and engineering applications.

Keywords: performance evaluation, multiprocessor file system, parallel I/O, pario-bib

Comment: Some new numbers over bordawekar:delta-fs-TR, but basically the same conclusions.

bordawekar:delta-fs-TR:
Rajesh Bordawekar, Alok Choudhary, and Juan Miguel Del Rosario. An experimental performance evaluation of Touchstone Delta Concurrent File System. Technical Report SCCS-420, NPAC, Syracuse University, 1992.
See also later version bordawekar:delta-fs.

Keywords: performance evaluation, multiprocessor file system, parallel I/O, pario-bib

Comment: Evaluating the Caltech Touchstone Delta (512 nodes, 32 I/O nodes, 64 disks, 8 MB cache per I/O node). Basic measurements of different access patterns and I/O modes. Location in network doesn't seem to matter. Throughput is often limited by the software; at least, the full hardware throughputs are rarely obtained. Sometimes they are compnode-limited, and other times they may be being limited by the cache management. There must be a way to push bottleneck back to the disks .

bordawekar:efficient:
Rajesh Bordawekar, Rajeev Thakur, and Alok Choudhary. Efficient compilation of out-of-core data parallel programs. Technical Report SCCS-622, NPAC, April 1994.
See also later version bordawekar:reorganize.

Abstract: Large scale scientific applications, such as the Grand Challenge applications, deal with very large quantities of data. The amount of main memory in distributed memory machines is usually not large enough to solve problems of realistic size. This limitation results in the need for system and application software support to provide efficient parallel I/O for out-of-core programs. This paper describes techniques for translating out-of-core programs written in a data parallel language like HPF to message passing node programs with explicit parallel I/O. We describe the basic compilation model and various steps involved in the compilation. The compilation process is explained with the help of an out-of-core matrix multiplication program. We first discuss how an out-of-core program can be translated by extending the method used for translating in-core programs. We then describe how the compiler can optimize the code by estimating the I/O costs associated with different array access patterns and selecting the method with the least I/O cost. This optimization can reduce the amount of I/O by as much as an order of magnitude. Performance results on the Intel Touchstone Delta are presented and analyzed.

Keywords: parallel I/O, compiler, pario-bib

Comment: Revised as bordawekar: This is actually fairly different from thakur:runtime. They describe the same basic compiler technique, where arrays are distributed across processors, and each processor has a local array file for holding data from its local partitions. Then the I/O needed for a loop is broken into slabs, where the program proceeds as an alternation of (read slabs, compute, write slabs). The big new thing here is that the compiler tries different ways to form slabs (e.g., by row or by column), estimates the number of I/Os and the amount of data moved for each case, and chooses the case with the smallest amount of I/O. They also mention how the choice of memory size allocated to different arrays affects the amount of IO, but give no algorithm other than "try all the possibilities."

bordawekar:exemplar:
Rajesh Bordawekar, Steven Landherr, Don Capps, and Mark Davis. Experimental evaluation of the Hewlett-Packard Exemplar file system. ACM SIGMETRICS Performance Evaluation Review, 25(3):21-28, December 1997.
See also earlier version bordawekar:exemplar-tr2.
See also later version bordawekar:jexemplar.

Keywords: multiprocessor file system, performance evaluation, parallel I/O, pario-bib

Comment: Part of a special issue on parallel and distributed I/O.

bordawekar:exemplar-tr2:
Rajesh Bordawekar. Quantitative characterization and analysis of the I/O behavior of a commercial distributed-shared-memory machine. Technical Report CACR 157, Center of Advanced Computing Research, California Insititute of Technology, March 1998.
See also later version bordawekar:exemplar.

Abstract: This paper presents a unified evaluation of the I/O behavior of a commercial clustered DSM machine, the HP Exemplar. Our study has the following objectives: (1) To evaluate the impact of different interacting system components, namely, architecture, operating system, and programming model, on the overall I/O behavior and identify possible performance bottlenecks and (2) To provide hints to the users for achieving high out-of-box I/O throughput. We find that for the DSM machines that are built as a cluster of SMP nodes, integrated clustering of computing and I/O resources, both hardware and software, is not advantageous for two reasons. First, within an SMP node, the I/O bandwidth is often restricted by the performance of the peripheral components and cannot match the memory bandwidth. Second, since the I/O resources are shared as a global resource, the file-access costs become non-uniform and the I/O behavior of the entire system, in terms of the scalability and balance, degrades. We observe that the buffered I/O performance is determined not only by the I/O subsystem, but also by the programming model, global-shared memory subsystem, and data-communication mechanism. Moreover, programming-model support can be effectively used to overcome the performance constraints created by the architecture and operating system. For example, on the HP Exemplar, users can achieve high I/O throughput by using features of the programming model that balance the sharing and locality of the user buffers and file systems. Finally, we believe that at present, the I/O subsystems are being designed in isolation and there is a need for mending the traditional memory-oriented design approach to address this problem.

Keywords: parallel I/O, pario-bib, workload characterization, distributed shared memory

bordawekar:framework:
Rajesh Bordawekar and Alok Choudhary. A framework for representing data parallel programs and its application in program reordering. Technical Report SCCS-698, NPAC, Syracuse University, March 1995.

Keywords: data parallel, parallel I/O, pario-bib

Comment: Although this is mostly a compilers paper, there is a little bit about parallel I/O here. They comment briefly on how their compiler framework will help them make a compiler that can provide advice to the file system about prefetching and cache replacement, and to decide on the layout of scratch files to optimize locality.

bordawekar:hpf:
Rajesh Bordawekar and Alok Choudhary. HPF with parallel I/O extensions. Technical Report SCCS-613, NPAC, Syracuse University, 1993.

Keywords: parallel I/O, pario-bib

Comment: They propose some extensions to HPF to accomodate parallel I/O.

bordawekar:hpfio:
Rajesh Bordawekar and Alok Choudhary. Extending I/O capabilities of High Performance Fortran: Initial experiences. Technical Report CACR-115, Scalable I/O Initiative, Center of Advanced Computing Research, California Insititute of Technology, December 1995.

Abstract: This report presents implementation details of the prototype PASSION compiler. The PASSION compiler provides support for: (1) Accessing multidimensional in-core arrays and (2) Out-of-core computations. The PASSION compiler takes as input an annotated I/O intensive (either an out-of-core program or program accessing distributed arrays from files) High Performance Fortran (HPF) program. Using hints provided by the user, the compiler modifies the computation so as to minimize the I/O cost and restructures the program to incorporate explicit I/O calls. In this report, compilation of out-of-core FORALL constructs is illustrated using representative programs. Compiler support for accessing distributed in-core data is explained using illustrative examples and supplemented by experimental results.

Keywords: parallel I/O, compiler, FORTRAN, HPF, pario-bib

Comment: Currently not available on WWW. Describes implementation details of the PASSION Compiler.

bordawekar:jexemplar:
Rajesh Bordawekar. Quantitative characterization and analysis of the I/O behavior of a commercial distributed-shared-memory machine. IEEE Transactions on Parallel and Distributed Systems, 11(5):509-526, May 2000.
See also earlier version bordawekar:exemplar.

Abstract: This paper presents a unified evaluation of the I/O behavior of a commercial clustered DSM machine, the HP Exemplar. Our study has the following objectives: 1) To evaluate the impact of different interacting system components, namely, architecture, operating system, and programming model, on the overall I/O behavior and identify possible performance bottlenecks, and 2) To provide hints to the users for achieving high out-of-box I/O throughput. We find that for the DSM machines that are built as a cluster of SMP nodes, integrated clustering of computing and I/O resources, both hardware and software, is not advantageous for two reasons. First, within an SMP node, the I/O bandwidth is often restricted by the performance of the peripheral components and cannot match the memory bandwidth. Second, since the I/O resources are shared as a global resource, the file-access costs become nonuniform and the I/O behavior of the entire system, in terms of both scalability and balance, degrades.

We observe that the buffered I/O performance is determined not only by the I/O subsystem, but also by the programming model, global-shared memory subsystem, and data-communication mechanism. Moreover, programming-model support can be used effectively to overcome the performance constraints created by the architecture and operating system. For example, on the HP Exemplar, users can achieve high I/O throughput by using features of the programming model that balance the sharing and locality of the user buffers and file systems. Finally, we believe that at present, the I/O subsystems are being designed in isolation, and there is a need for mending the traditional memory-oriented design approach to address this problem.

Keywords: parallel I/O, pario-bib, workload characterization, distributed shared memory

bordawekar:model:
Rajesh Bordawekar, Alok Choudhary, Ken Kennedy, Charles Koelbel, and Michael Paleczny. A model and compilation strategy for out-of-core data parallel programs. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 1-10, Santa Barbara, CA, July 1995. ACM Press. Also available as the following technical reports: NPAC Technical Report SCCS-0696, CRPC Technical Report CRPC-TR94507-S, SIO Technical Report CACR SIO-104.
See also earlier version bordawekar:model-tr.

Abstract: It is widely acknowledged in high-performance computing circles that parallel input/output needs substantial improvement in order to make scalable computers truly usable. We present a data storage model that allows processors independent access to their own data and a corresponding compilation strategy that integrates data-parallel computation with data distribution for out-of-core problems. Our results compare several communication methods and I/O optimizations using two out-of-core problems, Jacobi iteration and LU factorization.

Keywords: parallel I/O, compiler, pario-bib

bordawekar:model-tr:
Rajesh Bordawekar, Alok Choudhary, Ken Kennedy, Charles Koebel, and Mike Paleczny. A model and compilation strategy for out-of-core data parallel programs. Technical Report CRPC-TR94507-S, CRPC, December 1994.
See also later version bordawekar:model.

Keywords: compilers, parallel I/O, out-of-core applications, pario-bib

Comment: Basically a summary of their I/O and compilation model for out-of-core compilation of HPF programs. See also paleczny:support.

bordawekar:msthesis:
Rajesh R. Bordawekar. Issues in software support for parallel I/O. Master's thesis, Syracuse University, May 1993.

Abstract: This thesis looks at various issues in providing application-level software support for parallel I/O. We show that the performance of the parallel I/O system varies greatly as a function of data distributions. We present runtime I/O primitives for parallel languages which allow the user to obtain a consistent performance over a wide range of data distributions.

In order to design these primitives, we study various parameters used in the design of a parallel file system. We evaluate the performance of Touchstone Delta Concurrent File System and study the effect of parameters like number of processors, number of disks, file size on the system performance. We compute the I/O costs for common data distributions. We propose an alternative strategy -two phase data access strategy- to optimize the I/O costs connected with data distributions. We implement runtime primitives using the two-phase access strategy and show that using these primitives not only I/O access rates are improved but also user can obtain complex data distributions like block-block and block-cyclic.

Keywords: parallel I/O, pario-bib

Comment: This is basically a consolidation of the other bordawekar papers, in more detail. So he covers an experimental analysis of the touchstone delta; of the problems arising from the direct-access model for non-conforming distributions; of the two-phase model; and of the run-time library to support two-phase access. See also bordawekar:reorganize, thakur:runtime, bordawekar:efficient, thakur:out-of-core, delrosario:two-phase, bordawekar:primitives, bordawekar:delta-fs.

bordawekar:placement:
Rajesh Bordawekar, Alok Choudhary, and J. Ramanujam. A framework for integrated communication and I/O placement. In Proceedings of the 2nd International Euro-Par'96, Parallel Processing, volume 1124 of Lecture Notes in Computer Science, pages 541-552. Springer-Verlag, August 1996.
See also earlier version bordawekar:placement-tr.

Abstract: This paper describes a framework for analyzing dataflow within an out-of-core parallel program. Dataflow properties of FORALL statement are analyzed and a unified I/O and communication placement framework is presented. This placement framework can be applied to many problems, which include eliminating redudant I/O incurred in communication. The framework is validated by applying it for optimizing I/O and communication in out-of-core stencil problems. Experimental performance results on an Intel Paragon show significant reduction in I/O and communication overhead.

Keywords: parallel I/O, compiler, pario-bib

bordawekar:placement-tr:
Rajesh Bordawekar, Alok Choudhary, and J. Ramanujam. A framework for integrated communication and I/O placement. Technical Report CACR-117, Scalable I/O Initiative, Center of Advanced Computing Research, California Insititute of Technology, February 1996.
See also later version bordawekar:placement.

Abstract: In this paper, we describe a framework for optimizing communication and I/O costs in out-of-core problems. We focus on communication and I/O optimization within a FORALL construct. We show that existing frameworks do not extend directly to out-of-core problems and can not exploit the FORALL semantics. We present a unified framework for the placement of I/O and communication calls and apply it for optimizing communication for stencil applications. Using the experimental results, we demonstrate that correct placement of I/O and communication calls can completely eliminate extra file I/O from communication and obtain significant performance improvement.

Keywords: parallel I/O, compiler, pario-bib

bordawekar:primitives:
Rajesh Bordawekar, Juan Miguel del Rosario, and Alok Choudhary. Design and evaluation of primitives for parallel I/O. In Proceedings of Supercomputing '93, pages 452-461, Portland, OR, 1993. IEEE Computer Society Press.

Abstract: In this paper, we show that the performance of parallel file systems can vary greatly as a function of the selected data distributions, and that some data distributions can not be supported. Also, we describe how the parallel language extensions, though simplifying the programming, do not address the performance problems found in parallel file systems.

We have devised an alternative scheme for conducting parallel I/O - the Two-Phase Access Strategy - which guarantees higher and more consistent performance over a wider spectrum of data distributions. We have designed and implemented runtime primitives that make use of the two-phase access strategy to conduct parallel I/O, and facilitate the programming of parallel I/O operations. We describe these primitives in detail and provide performance results which show that I/O access rates are improved by up to several orders of magnitude. Further, we show that the variation in performance over various data distributions is restricted to within a factor of 2 of the best access rate.

Keywords: parallel I/O, pario-bib

Comment: Much of this is the same as delrosario:two-phase, except for section 4 where they describe their actual run-time library of primitives, with a little bit about how it works. It's not clear, for example, how their meta-data structures are distributed across the machine. They also do not describe their methods for the data redistribution.

bordawekar:reorganize:
Rajesh Bordawekar, Alok Choudhary, and Rajeev Thakur. Data access reorganizations in compiling out-of-core data parallel programs on distributed memory machines. Technical Report SCCS-622, NPAC, Syracuse, NY 13244, September 1994.
See also earlier version bordawekar:efficient.

Keywords: parallel I/O, compilation, pario-bib

Comment: Basically they give a case study of out-of-core matrix multiplication to emphasize that the compiler's choice of loop ordering and matrix distribution for in-core matmult is not a very good choice for out-of-core matmult, because it causes too much I/O. By reorganizing the data and the loops, they get much better performance. In this particular case there are known algorithms which they should have used. In general they make the point that the compiler should consider several organizations, and estimate their costs, before generating code. They don't propose anything more sophisticated than to try all the possible organizations.

bordawekar:stencil:
Rajesh Bordawekar, Alok Choudhary, and J. Ramanujam. Automatic optimization of communication in compiling out-of-core stencil codes. In Proceedings of the 10th ACM International Conference on Supercomputing, pages 366-373, Philadelphia, PA, May 1996. ACM Press.
See also earlier version bordawekar:stencil-tr.

Abstract: In this paper, we describe a technique for optimizing commununication for out-of-core distributed memory stencil problems. In these problems, communication may require both inter-processor communication and file I/O. We show that in certain cases, extra file I/O incurred in communication can be completely eliminated by reordering in-core computations. The in-core computation pattern is decided by: (1) how the out-of-core data distributed into in-core slabs (tiling) and (2) how the slabs are accessed. We show that a compiler using the stencil and processor information can choose the tiling parameters and schedule the tile accesses so that the extra file I/O is eliminated and overall performance is improved.

Keywords: compiler, parallel I/O, pario-bib

bordawekar:stencil-tr:
Rajesh Bordawekar, Alok Choudhary, and J. Ramanujam. Automatic optimization of communication in out-of-core stencil codes. Technical Report CACR-114, Scalable I/O Initiative, Center of Advanced Computing Research, California Insititute of Technology, November 1995.
See also later version bordawekar:stencil.

Abstract: In this paper, we describe a technique for optimizing commununication for out-of-core distributed memory stencil problems. In these problems, communication may require both inter-processor communication and file I/O. We show that in certain cases, extra file I/O incurred in communication can be completely eliminated by reordering in-core computations. The in-core computation pattern is decided by: (1) how the out-of-core data distributed into in-core slabs (tiling) and (2) how the slabs are accessed. We show that a compiler using the stencil and processor information can choose the tiling parameters and schedule the tile accesses so that the extra file I/O is eliminated and overall performance is improved.

Keywords: compiler, parallel I/O, pario-bib

bordawekar:support:
Rajesh Bordawekar and Alok Choudhary. Compiler and runtime support for parallel I/O. In Proceedings of IFIP Working Conference (WG10.3) on Programming Environments for Massively Parallel Distributed Systems, Monte Verita, Ascona, Switzerland, April 1994. Birkhaeuser Verlag AG, Basel, Switzerland.

Keywords: parallel I/O, pario-bib

Comment: Contains much of the material from bordawekar:hpf.

bordawekar:thesis:
Rajesh Bordawekar. Techniques for Compiling I/O Intensive Parallel Programs. PhD thesis, Electrical and Computer Engineering Dept., Syracuse University, April 1996. Also available as Caltech technical report CACR-118.

Abstract: This dissertation investigates several issues in providing compiler support for I/O intensive parallel programs. In this dissertation, we focus on satisfying two I/O requirements, namely, support for accessing multidimensional arrays and support for out-of-core computations. We analyze working spaces in I/O intensive programs and propose three execution models to be used by users or compilers for developing efficient I/O intensive parallel programs. Different phases in compiling out-of-core parallel programs are then described. Three different methods for performing communication are presented and validated using representative application templates.We illustrate that communication in out-of-core programs may require both inter-processor communication and file I/O. We show that using the copy-in-copy-out semantics of the HPF FORALL construct, extra file I/O incurred in communication can be completely eliminated by reordering in-core computations. Two different approaches for reordering in-core computations are presented, namely, integrated tiling and scheduling heuristic, and dataflow framework for placing communication and I/O calls. The discussion is supplemented with experimental performance results of representative stencil applications. Finally, an overview of the prototype \textsf{PASSION} (Parallel And Scalable Software for I/O) compiler is presented. This compiler takes an annotated out-of-core High Performance Fortran (HPF) program as input and generates the corresponding node+message-passing program with calls to the parallel I/O runtime library. We illustrate various functionalities of the compiler using example programs and supplement them by experimental results.

Keywords: parallel I/O, compiler, HPF, pario-bib

bornstein:reshuffle:
C. Bornstein and P. Steenkiste. Data reshuffling in support of fast I/O for distributed-memory machines. In Proceedings of the Third IEEE International Symposium on High Performance Distributed Computing, pages 227-235, August 1994.

Keywords: parallel I/O, distributed memory, pario-bib

Comment: In a sense, this is about a two-phase technique for network I/O. They consider the problem of feeding a fast network interface (HIPPI) from a distributed-memory parallel machine (iWARP) in which the individual internal links are slower than the external network. So they get the processors to cooperate to reshuffle the data into a canonical layout that is convenient to send to the gateway node, and from there onto the external network.

braam:lustre-arch:
Peter J. Braam. The lustre storage architecture. Cluster File Systems Inc. Architecture, design, and manual for Lustre, November 2002. http://www.lustre.org/docs/lustre.pdf.

Keywords: object-based storage, distributed file system, parallel file system, pario-bib

Comment: Describes an open-source project to develop an object-based file system for clusters. Related to the NASD project at CMU (http://www.pdl.cs.cmu.edu/NASD/).

bradley:ipsc2io:
David K. Bradley and Daniel A. Reed. Performance of the Intel iPSC/2 input/output system. In Proceedings of the Fourth Conference on Hypercube Concurrent Computers and Applications, pages 141-144, Monterey, CA, 1989. Golden Gate Enterprises, Los Altos, CA.

Keywords: hypercube, parallel I/O, Intel, pario-bib

Comment: Some measurements and simulations of early CFS performance. Looks terrible, but they disclaim that it is a beta version of the first CFS. They determined that the disks are the bottleneck. But this may just imply that they need more disks. Their parallel synthetic applications had each process read a separate file. CFS had ridiculous traffic overhead. Again, this was beta CFS.

brandwijn:dasd:
Alexandre Brandwajn. Performance benefits of parallelism in cached DASD controllers. Technical Report UCSC-CRL-88-30, Computer Research Laboratory, UC Santa Cruz, November 1988.

Keywords: parallel I/O, disk caching, disk architecture, pario-bib

Comment: Some new DASD products with caches overlap cache hits with prefetch of remainder of track into cache. They use analytical model to evaluate performance of these. They find performance improvements of 5-15 percent under their assumptions.

brezany:HPF:
Peter Brezany, Michael Gernt, Piyush Mehotra, and Hans Zima. Concurrent file operations in a High Performance FORTRAN. In Proceedings of Supercomputing '92, pages 230-237, 1992.

Keywords: supercomputing, fortran, multiprocessor file system interface, pario-bib

Comment: Describing their way of writing arrays to files so that they are written in a fast, parallel way, and so that (if read in same distribution) they can be read fast and parallel. Normal read and write forces standard ordering, but cread and cwrite uses a compiler and runtime selected ordering, which is stored in the file so it can be used when rereading. Good for temp files.

brezany:HPF2:
Peter Brezany, Jonghyun Lee, and Marianne Winslett. Parallel I/O support for hpf on computational grids. In Proceedings of the Fourth International Symposium on High Performance Computing, volume 2327 of Lecture Notes in Computer Science, pages 539-550. Springer-Verlag, May 2002.

Abstract: Recently several projects have started to implement large-scale high-performance computing on "computational grids" composed of heterogeneous and geographically distributed systems of computers, networks, and storage devices that collectively act as a single "virtual supercomuter". One of the great challenges for this environment is to provide appropriate high-level programming models. High Performance Fortran (HPF) is a language of choice for development of data parallel components of Grid applications. Another challenge is to provide efficient access to data that is distributed across local and remote Grid resources. In this paper, constructs to specify parallel input and output (I/O) operations on multidimensional arrays on the Grid in the context of HPF are proposed. The paper also presents implementation concepts that are based on the HPF compiler VFC, the parallel I/O runtime system Panda, Internet, and Grid technologies. Preliminary experimental performance results are discussed in the context of a real application example.

Keywords: parallel I/O, Fortran, HPF, data-parallel, computational grid, pario-bib

brezany:architecture:
Peter Brezany, Thomas A. Mueck, and Erich Schikuta. A software architecture for massively parallel input-output. In Third International Workshop PARA'96 (Applied Parallel Computing - Industrial Computation and Optimization), volume 1186 of Lecture Notes in Computer Science, pages 85-96, Lyngby, Denmark, August 1996. Springer-Verlag. Also available as Technical Report of the Inst. f. Angewandte Informatik u. Informationssysteme, University of Vienna, TR 96202.

Abstract: For an increasing number of data intensive scientific applications, parallel I/O concepts are a major performance issue. Tackling this issue, we provide an outline of an input/output system designed for highly efficient, scalable and conveniently usable parallel I/O on distributed memory systems. The main focus of this paper is the parallel I/O runtime system support provided for software-generated programs produced by parallelizing compilers in the context of High Performance FORTRAN efforts. Specifically, our design is presented in the context of the Vienna Fortran Compilation System.

Keywords: compiler transformations, runtime support, parallel I/O, prefetching, pario-bib

brezany:compiling:
Peter Brezany, Thomas A. Mueck, and Erich Schikuta. Mass storage support for a parallelizing compilation system. In International Conference Eurosim'96- HPCN challenges in Telecomp and Telecom: Parallel Simulation of Complex Systems and Large Scale Applications, pages 63-70, Delft, The Netherlands, June 1996. North-Holland, Elsevier Science.

Keywords: parallel I/O, high performance mass storage system, high performance languages, compilation techniques, data administration, pario-bib

brezany:io-support:
Peter Brezany, Thomas A. Mueck, and Erich Schikuta. Language, compiler and parallel database support for I/O intensive applications. In Proceedings of the International Conference on High Performance Computing and Networking, volume 919 of Lecture Notes in Computer Science, pages 14-20, Milan, Italy, May 1995. Springer-Verlag. also available as Technical Report of the Inst. f. Software Technology and Parallel Systems, University of Vienna, TR95-8, 1995.

Keywords: compiler transformations, runtime support, declustering, parallel I/O, pario-bib

Comment: They describe some extensions to Vienna Fortran that support parallel I/O, and how they plan to extend the compiler and run-time system to help. They are somewhat short on details, however. The basic idea is that file declustering is based on hints from the compiler or programmer about how the file will be used, eg, as a matrix distributed in thus-and-so way.

brezany:irregular:
P. Brezany, A. Choudhary, and M. Dang. Parallelization of irregular out-of-core applications for distributed-memory systems. High-Performance Computing and Networking, 1225:811-820, 1997.
See also earlier version brezany:irregular-tr.

Abstract: Large scale irregular applications involve data arrays and other data structures that are too large to fit in main memory and hence reside on disks; such applications are called out-of-core applications. This paper presents techniques for implementing this kind of applications. In particular we present a design for a runtime system to efficiently support parallel execution of irregular out-of-core codes on distributed-memory systems. Furthermore, we describe the appropriate program transformations required to reduce the I/O overheads for staging data as well as for communication while maintaining load balance. The proposed techniques can be used by a parallelizing compiler or by users writing programs in node + message passing style. We have done a preliminary implementation of the techniques presented here. We introduce experimental results from a template CFD code to demonstrate the efficacy of the presented techniques.

Keywords: parallel I/O, out of core, compiler, library, pario-bib

Comment: The authors present techniques for implementing large scale irregular out-of-core applications. The techniques they describe can be used by a parallel compiler (e.g., HPF and its extensions) or by users using message passing. The objectives of the proposed techniques are to ''to minimize I/O accesses in all steps while maintaining load balance and minimal communication''. They demonstrate the effectivness of their techniques by showing results from a Computational Fluid Dynamics (CFD) code.

brezany:irregular-tr:
P. Brezany and A. Choudhary. Techniques and optimizations for developing irregular out-of-core applications on distributed-memory systems. Technical Report 96-4, Institute for Software Technology and Parallel Systems, University of Vienna, November 1996.

Keywords: parallel I/O, out of core, irregular applications, compiler, pario-bib

brezany:technology:
Peter Brezany, Marianne Winslett, Denis A. Nicole, and Toni Cortes. Parallel I/O and storage technology. In Proceedings of the Seventh International Euro-Par Conference, volume 2150 of Lecture Notes in Computer Science, pages 887-888, Manchester, UK, August 2001. Springer-Verlag.

Abstract: Input and output (I/O) is a major performance bottleneck for large-scale scientific applications running on parallel platforms. For example, it is not uncommon that performance of carefully tuned parallel programs can slow dramatically when they read or write files. This is because many parallel applications need to access large amounts of data, and although great advances have been made in the CPU and communication performance of parallel machines, similar advances have not been made in their I/O performance. The densities and capacities of disks have increased significantly, but improvement in performance of individual disks has not followed the same pace. For parallel computers to be truly usable for solving real, large-scale problems, the I/O performance must be scalable and balanced with respect to the CPU and communication performance of the system. Parallel I/O techniques can help to solve this problem by creating multiple data paths between memory and disks. However, simply adding disk drives to an I/O system without considering the overall software design will improve performance only marginally.

Keywords: pario-bib, parallel I/O

broom:acacia:
Bradley M. Broom. A synchronous file server for distributed file systems. In Proceedings of the 16th Australian Computer Science Conference, 1993.
See also earlier version broom:acacia-tr.

Keywords: distributed file system, pario-bib

Comment: See broom:acacia-tr. See also broom:impl, lautenbach:pfs, mutisya:cache, and broom:cap.

broom:acacia-tr:
Bradley M. Broom. A synchronous file server for distributed file systems. Technical Report TR-CS-92-12, Dept. of Computer Science, Australian National University, August 1992.
See also later version broom:acacia.

Keywords: distributed file system, pario-bib

Comment: This paper is not specifically about parallel I/O, but the file system will be used in the AP-1000 multiprocessor. Acacia is a file server that is optimized for synchronous writes, like those used in stateless protocols (eg, NFS). It writes inodes in blocks in any free location that is close to the current head position, using indirect inode blocks to track those. Indirect blocks are in turn written anywhere convenient, and their positions are tracked by the superblock. There is one slot in each cylinder reserved for the superblock, which is timestamped. They get good performance but claim to need a better implementation, and a faster allocation algorithm. No indication of effect on read performance.

broom:cap:
Bradley M. Broom and Robert Cohen. Acacia: A distributed, parallel file system for the CAP-II. In Proceedings of the First Fujitsu-ANU CAP Workshop, November 1990.

Keywords: distributed file system, multiprocessor file system, pario-bib

Comment: See also broom:acacia, broom:impl, lautenbach:pfs, and mutisya:cache. This describes the semantic model for their file system. Modelled a lot after Amoeba, they have capabilities that represent immutable files. There are create, destroy, read, and write operations, but the read and write can affect only part of the file, if desired. They also have an atomic ``copy'' operation, which creates a snapshot of the current state of the file. They also have ``spawn'' and ``merge'' operations, which are essentially begin and end a transaction, a set of changes that are atomically merged into the file later. These seem to be addressing issues of concurrency more than of parallelism. They also discuss implementation somewhat, mentioning the use of distributed caches and log-structured disk layout. Prototype in Linda (!).

broom:impl:
Bradley M. Broom. Implementation and performance of the Acacia file system. In Proceedings of the Second Fujitsu-ANU CAP Workshop, November 1991.

Keywords: distributed file system, multiprocessor file system, pario-bib

Comment: See also broom:acacia, lautenbach:pfs, mutisya:cache, and broom:cap. This paper is a very sketchy overview of those; it is better to read them.

broom:kelpio:
Bradley Broom, Rob Fowler, and Ken Kennedy. KelpIO: A telescope-ready domain-specific I/O library for irregular block-structured applications. In Proceedings of the First IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 148-155, Brisbane, Australia, May 2001. IEEE Computer Society Press.

Abstract: To ameliorate the need to spend significant programmer time modifying parallel programs to achieve high-performance, while maintaining compact, comprehensible source codes, the paper advocates the use of telescoping language technology to automatically apply, during the normal compilation process, high-level performance enhancing transformations to applications using a high-level domain-specific I/O library. We believe that this approach will be more acceptable to application developers than new language extensions, but will be just as amenable to optimization by advanced compilers, effectively making it a domain-specific language extension for I/O. The paper describes a domain-specific I/O library for irregular block-structured applications based on the KeLP library, describes high-level transformations of the library primitives for improving performance, and describes how a high-level domain-specific optimizer for applying these transformations could be constructed rising the telescoping languages framework.

Keywords: parallel I/O, domain-specific I/O library, scientific computing, astronomy, pario-bib

broom:perf:
Bradley M. Broom. Performance measurement of the Acacia parallel file system for the AP1000 multicomputer. In Proc. Second Parallel Computing Workshop, pages {P1-F-1} to {P1-F-11}, Kawasaki, Japan, November 1993. Fujitsu Parallel Computing Research Facilities, Fujitsu Laboratories Ltd.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: They evaluate the performance of Acacia with some simple synthetic benchmarks. Performance limited by implementation problems in the sequential file system. Otherwise no real surprises.

brown:benchmarks:
Aaron Brown and David A. Patterson. Towards availability benchmarks: A case study of software RAID system. In Proceedings of the 2000 USENIX Technical Conference, pages 263-276. USENIX Association, 2000.

Abstract: Benchmarks have historically played a key role in guiding the progress of computer science systems research and development, but have traditionally neglected the areas of availability, maintainability, and evolutionary growth, areas that have recently become critically important in high-end system design. As a first step in addressing this deficiency, we introduce a general methodology for benchmarking the availability of computer systems. Our methodology uses fault injection to provoke situations where availability may be compromised, leverages existing performance benchmarks for workload generation and data collection, and can produce results in both detail-rich graphical presentations or in distilled numerical summaries. We apply the methodology to measure the availability of the software RAID systems shipped with Linux, Solaris 7 Server, and Windows 2000 Server, and find that the methodology is powerful enough not only to quantify the impact of various failure conditions on the availability of these systems, but also to unearth their design philosophies with respect to transient errors and recovery policy.

Keywords: RAID, disk array, parallel I/O, pario-bib

browne:io-arch:
J. C. Browne, A. G. Dale, C. Leung, and R. Jenevein. A parallel multi-stage I/O architecture with self-managing disk cache for database management applications. In Proceedings of the Fourth International Workshop on Database Machines. Springer-Verlag, March 1985.

Keywords: parallel I/O, disk caching, database, pario-bib

Comment: A fancy interconnection from procs to I/O processors, intended mostly for DB applications, that uses cache at I/O end and a switch with smarts. Cache is associative. Switch helps out in sort and join operations.

bruce:chimp:
R. A. A. Bruce, S. R. Chapple, N. B. MacDonald, and A. S. Trew. CHIMP and PUL: Support for portable parallel programming. Technical Report EPCC-TR93-07, Edinburgh Parallel Computing Center, March 1993.

Keywords: parallel programming, parallel I/O, pario-bib

Comment: An overview of the CHIMP message-passing library and the PUL set of libraries. Key design goal is portability; they run on many systems. PUL includes PUL-GF, which supports parallel access to files (see chapple:pulgf, chapple:pulgf-adv, and chapple:pario). Other PUL libraries support grids and meshes, global communications, and task farms. Contact pul@epcc.ed.ac.uk.

brunet:factor:
Jean-Philippe Brunet, Palle Pedersen, and S. Lennart Johnsson. Load-balanced LU and QR factor and solve routines for scalable processors with scalable I/O. In Proceedings of the 17th IMACS World Congress, Atlanta, GA, July 1994. Also available as Harvard University Computer Science Technical Report TR-20-94.

Abstract: The concept of block-cyclic order elimination can be applied to out-of-core $LU$ and $QR$ matrix factorizations on distributed memory architectures equipped with a parallel I/O system. This elimination scheme provides load balanced computation in both the factor and solve phases and further optimizes the use of the network bandwidth to perform I/O operations. Stability of LU factorization is enforced by full column pivoting. Performance results are presented for the Connection Machine system CM-5.

Keywords: parallel I/O, linear algebra, out-of-core, pario-bib

Comment: Short, not many details. Performance results shows about 3.5 Gflops for all problem sizes, both in-core on small N and out-of-core on large N.

cabrera:pario:
Luis-Felipe Cabrera and Darrell D. E. Long. Swift: Using distributed disk striping to provide high I/O data rates. Computing Systems, 4(4):405-436, Fall 1991.
See also earlier version cabrera:pariotr.

Keywords: parallel I/O, disk striping, distributed file system, pario-bib

Comment: See cabrera:swift, cabrera:swift2. Describes the performance of a Swift prototype and simulation results. They stripe data over multiple disk servers (here SPARC SLC with local disk), and access it from a SPARC2 client. Their prototype gets nearly linear speedup for reads and asynchronous writes; synchronous writes are slower. They hit the limit of the Ethernet and/or the client processor with three disk servers. Adding another Ethernet allowed them to go higher. Simulation shows good scaling. Seems like a smarter implementation would help, as would special- purpose parity-computation hardware. Good arguments for use of PID instead of RAID, to avoid a centralized controller that is both a bottleneck and a single point of failure.

cabrera:pariotr:
Luis-Felipe Cabrera and Darrell D. E. Long. Swift: Using distributed disk striping to provide high I/O data rates. Technical Report CRL-91-46, UC Santa Cruz, 1991.
See also later version cabrera:pario.

Keywords: parallel I/O, disk striping, distributed file system, pario-bib

cabrera:stripe:
Luis-Felipe Cabrera and Darell D. E. Long. Using data striping in a local area network. Technical Report UCSC-CRL-92-09, Univ. California at Santa Cruz, March 1992.

Keywords: striping, parallel I/O, distributed system, pario-bib

Comment: See cabrera:swift2, cabrera:swift, cabrera:pario. Not much new here. Simulates higher-performance architectures. Shows reasonable scalability. Counts 5 inst/byte for parity computation.

cabrera:swift:
Luis-Felipe Cabrera and Darrell D. E. Long. Swift: A storage architecture fo large objects. Technical Report UCSC-CRL-89-04, U.C. Santa Cruz, 1990.
See also later version cabrera:swift2.

Keywords: parallel I/O, disk striping, distributed file system, multimedia, pario-bib

Comment: See cabrera:swift2. A brief outline of a design for a high-performance storage system, designed for storing and retrieving large objects like color video or visualization data at very high speed. They distribute data over several ``storage agents'', which are some form of disk or RAID. They are all connected by a high-speed network. A ``storage manager'' decides where to spread each file, what kind of reliability mechanism is used. User provides preallocation info such as size, reliability level, data rate requirements, and so forth.

cabrera:swift2:
Luis-Felipe Cabrera and Darell D. E. Long. Exploiting multiple I/O streams to provide high data-rates. In Proceedings of the 1991 Summer USENIX Technical Conference, pages 31-48, 1991.
See also earlier version cabrera:swift.

Keywords: parallel I/O, disk striping, distributed file system, multimedia, pario-bib

Comment: See also cabrera:swift. More detail than the other paper. Experimental results from a prototype that stripes files across a distributed file system. Gets almost linear speedup in certain cases. Much better than NFS. Simulation to extend it to larger systems.

calderon:implement:
Alejandro Calderón, Félix Garc\'\ia, Jesús Carretero, Jose M. Pérez, and Javier Fernández. An implementation of MPI-IO on Expand: A parallel file system based on NFS servers. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 2474 of Lecture Notes in Computer Science, pages 306-313. Springer-Verlag, 2002.

Abstract: This paper describes an implementation of MPI-IO using aB new parallel file system, called Expand (Expandable Parallel File System)1, that is based on NFS servers. Expand combines multiple NFS servers to create a distributed partition where files are declustered. Expand requires no changes to the NFS server and uses RPC operations to provide parallel access to the same file. Expand is also independent of the clients, because all operations are implemented using RPC and NFS protocol. The paper describes the design, the implementation and the evaluation of Expand with MPI-IO. This evaluation has been made in Linux clusters and compares Expand and PVFS.

Keywords: parallel I/O, multiprocessor file system, NFS, pario-bib

cannataro:data-intensive:
Mario Cannataro, Domenico Talia, and Pradip K. Srimani. Parallel data intensive computing in scientific and commercial applications. Parallel Computing, 28(5):673-704, May 2002.

Abstract: Applications that explore, query, analyze, visualize, and, in general, process very large scale data sets are known as Data Intensive Applications. Large scale data intensive computing plays an increasingly important role in many scientific activities and commercial applications, whether it involves data mining of commercial transactions, experimental data analysis and visualization, or intensive simulation such as climate modeling. By combining high performance computation, very large data storage, high bandwidth access, and high- speed local and wide area networking, data intensive computing enhances the technical capabilities and usefulness of most systems. The integration of parallel and distributed computational environments will produce major improvements in performance for both computing intensive and data intensive applications in the future. The purpose of this introductory article is to provide an overview of the main issues in parallel data intensive computing in scientific and commercial applications and to encourage the reader to go into the more in-depth articles later in this special issue.

Keywords: parallel application, parallel I/O, pario-bib

cao:jtickertaip:
Pei Cao, Swee Boon Lim, Shivakumar Venkataraman, and John Wilkes. The TickerTAIP parallel RAID architecture. ACM Transactions on Computer Systems, 12(3):236-269, August 1994.
See also earlier version cao:tickertaip.

Keywords: parallel I/O, RAID, pario-bib

Comment: See cao:tickertaip-tr2.

cao:tickertaip:
Pei Cao, Swee Boon Lim, Shivakumar Venkataraman, and John Wilkes. The TickerTAIP parallel RAID architecture. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 52-63, 1993.
See also earlier version cao:tickertaip-tr2.
See also later version cao:jtickertaip.

Keywords: parallel I/O, RAID, pario-bib

Comment: Superceded by cao:tickertaip-tr2 and cao:jtickertaip.

cao:tickertaip-tr:
Pei Cao, Swee Boon Lim, Shivakumar Venkataraman, and John Wilkes. The TickerTAIP parallel RAID architecture. Technical Report HPL-92-151, HP Labs, December 1992.
See also later version cao:tickertaip-tr2.

Keywords: parallel I/O, RAID, pario-bib

Comment: A parallelized RAID architecture that distributes the RAID controller operations across several worker nodes. Multiple hosts can connect to different workers, allowing multiple paths into the array. The workers then communicate on their own fast interconnect to accomplish the requests, distributing parity computations across multiple workers. They get much better performance and reliability than plain RAID. They built a prototype and a performance simulator. Two-phase commit was needed for request atomicity, and a request sequencer was needed for serialization. Also found it was good to give the whole request info to all workers and to let them figure out what to do and when. Superceded by cao:tickertaip-tr2 and cao:tickertaip.

cao:tickertaip-tr2:
Pei Cao, Swee Boon Lim, Shivakumar Venkataraman, and John Wilkes. The TickerTAIP parallel RAID architecture. Technical Report HPL-93-25, HP Labs, April 1993.
See also earlier version cao:tickertaip-tr.
See also later version cao:tickertaip.

Keywords: parallel I/O, RAID, pario-bib

Comment: Revised version of cao:tickertaip, actually: ``It's the ISCA paper with some text edits plus some new results on what happens if you turn disk request-scheduling on. It's been sent to TOCS.''. Thus it supercedes both cao:tickertaip-tr and cao:tickertaip. Eventually published as cao:jtickertaip.

carballeira:adaptive:
Felix Garcia-Carballeira, Jesus Carretero, Alejandro Calderon, Jose M. Perez, and Jose D. Garcia. An adaptive cache coherence protocol specification for parallel input/output systems. IEEE Transactions on Parallel and Distributed Systems, 15(6):533-545, June 2004.

Abstract: Caching has been intensively used in memory and traditional file systems to improve system performance. However, the use of caching in parallel file systems and I/O libraries has been limited to I/O nodes to avoid cache coherence problems. In this paper, we specify an adaptive cache coherence protocol very suitable for parallel file systems and parallel I/O libraries. This model exploits the use of caching, both at processing and I/O nodes, providing performance increase mechanisms as aggressive prefetching and delayed-write techniques. The cache coherence problem is solved by using a dynamic scheme of cache coherence protocols with different sizes and shapes of granularity. The proposed model is very appropriate for parallel I/O interfaces, as MPI-IO. Performance results, obtained on an IBM SP2, are presented to demonstrate the advantages offered by the cache management methods proposed.

Keywords: parallel file system, caching, cache coherence, adaptive caching, protocol specification, pario-bib

carey:shore:
Michael J. Carey, David J. DeWitt, Michael J. Franklin, Nancy E. Hall, Mark L. McAuliffe, Jeffrey F. Naughton, Daniel T. Schuh, Marvin H. Solomon, C. K. Tan, Odysseas G. Tsatalos, Seth J. White, and Michael J. Zwilling. Shoring up persistent applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 383-394. ACM Press, 1994.

Keywords: persistent systems, database, parallel I/O, object-oriented, pario-bib

Comment: SHORE is a persistent object database system. It is intended for parallel or distributed systems, and attempts to combine both DB and file system features. Everything in the database is a typed object, in that there is a registered interface object that defines this type, including the basic data types of elements of the object, and methods that manipulate the object. Every object has an OID, and objects can refer to other objects with the OID. But they also support unix-like namespace, in which the names refer to objects by giving the OID. They also have a unix-compatibility library that provides access to many objects through the unix file interface. Every node has a SHORE server, and applications talk to their local server for all their needs. The local server talks to other servers as needed. The servers are also responsible for caching pages and managing locks and transactions.

carns:pvfs:
Philip H. Carns, Walter B. Ligon III, Robert B. Ross, and Rajeev Thakur. PVFS: A parallel file system for linux clusters. In Proceedings of the 4th Annual Linux Showcase and Conference, pages 317-327, Atlanta, GA, October 2000. USENIX Association.

Abstract: As Linux clusters have matured as platforms for low-cost, high-performance parallel computing, software packages to provide many key services have emerged, especially in areas such as message passing and networking. One area devoid of support, however, has been parallel file systems, which are critical for high-performance I/O on such clusters. We have developed a parallel file system for Linux clusters, called the Parallel Virtual File System (PVFS). PVFS is intended both as a high-performance parallel file system that anyone can download and use and as a tool for pursuing further research in parallel I/O and parallel file systems for Linux clusters.

In this paper, we describe the design and implementation of PVFS and present performance results on the Chiba City cluster at Argonne. We provide performance results for a workload of concurrent reads and writes for various numbers of compute nodes, I/O nodes, and I/O request sizes. We also present performance results for MPI-IO on PVFS, both for a concurrent read/write workload and for the BTIO benchmark. We compare the I/O performance when using a Myrinet network versus a fast-ethernet network for I/O-related communication in PVFS. We obtained read and write bandwidths as high as 700 Mbytes/sec with Myrinet and 225 Mbytes/sec with fast ethernet.

Keywords: parallel I/O, parallel file system, cluster file system, Linux, pario-bib

Comment: won the Best Paper Award.

carretero:case:
J. Carretero, F. Pérez, P. de Miguel, F. Garc\'\ia, and L. Alonso. Implementation of a parallel file system: CCFS a case of study. Technical Report FIM/84.1/DATSI/94, Universidad Politecnic Madrid, Madrid, Spain, 1994.

Abstract: This document briefly describes the components of the Cache Coherent File System (CCFS) source code. CCFS has three main components: Client File Server (CLFS), Local File Server (LFS), Concurrent Disk System (CDS). The main modules and functions of each component are described here. Special emphasys has been put on interfaces and data structures.

Keywords: multiprocessor file system, parallel I/O, pario-bib

Comment: See carretero:*, rosales:cds, perez:clfs.

carretero:compassion:
J. Carretero, J. No, S.-S. Park, A. Choudhary, and P. Chen. Compassion: a parallel I/O runtime system including chunking and compression for irregular applications. In Proceedings of the International Conference on High-Performance Computing and Networking, pages 668-677, April 1998.
See also later version carretero:compassion2.

Abstract: We present two designs, namely, "collective I/O" and "pipelined collective I/O", of a runtime library for irregular applications based on the two-phase collective I/O technique. We also present the optimization of both models by using chunking and compression mechanisms. In the first scheme, all processors participate in compressions and I/O at the same time, making scheduling of I/O requests simpler but creating a possibility of contention at the I/O nodes. In the second approach, processors are grouped into several groups, overlapping communication, compression, and I/O to reduce I/O contention dynamically. Finally, evaluation results are shown that demonstrates that we can obtain significantly high-performance for I/O above what has been possible so far

Keywords: PASSION, parallel I/O, compression, collective I/O, two-phase I/O, performance evaluation, pario-bib

carretero:compassion2:
J. Carretero, Jaechun No, A. Choudhary, and Pang Chen. COMPASSION: a parallel I/O runtime system including chunking and compression for irregular applications. In Proceedings of the Fifth International Symposium on Solving Irregularly Structured Problems in Parallel (IRREGULAR'98), pages 262-273, August 1998.
See also earlier version carretero:compassion.

Abstract: In this paper we present an experimental evaluation of COMPASSION, a runtime system for irregular applications based on collective I/O techniques. It provides a "Collective I/O" model, enhanced with "Pipelined" operations and compression. All processors participate in the I/O simultaneously, alone or grouped, making scheduling of I/O requests simpler and providing support for contention management. In-memory compression mechanisms reduce the total execution time by diminishing the amount of I/O requested and the I/O contention. Our experiments, executed on an Intel Paragon and on the ASCI/Red teraflops machine, demonstrate that COMPASSION can obtain significantly high-performance for I/O above what has been possible so far.

Keywords: PASSION, parallel I/O, compression, collective I/O, pario-bib

carretero:concepts:
J. Carretero, F. Pérez, P. de Miguel, F. Garc\'\ia, and L. Alonso. Multicomputer parallel file systems design concepts: CCFS a case of study. Technical Report FIM/79.1/DATSI/94, Universidad Politecnic Madrid, Madrid, Spain, 1994.

Keywords: multiprocessor file system, parallel I/O, pario-bib

Comment: See carretero:*, rosales:cds, perez:clfs.

carretero:evaluation:
J. Carretero, F. Pérez, P. de Miguel, F. Garc\'\ia, and L. Alonso. A multiprocessor parallel disk system evaluation. Decentralized and Distributed Systems, September 1993. IFIP Transactions A-39.

Abstract: This paper presents a Parallel Disk System (PDS) for general purpose multiprocessors, which provides support for conventional file systems and databases, as well as direct access for applications requiring high performance mass storage. We present a systematic method to characterize a parallel I/O system, using it to evaluate PDS and to identify an optimal PDS configuration. Several devices (single disk, Raid3 and Raid5), and different configurations of I/O nodes, each one with a different type of device, have been simulated. Throughput and I/O rate of each configuration have been obtained for the former configurations and different types of workloads (database, general purpose and scientific applications).

Keywords: multiprocessor file system, parallel I/O, pario-bib

Comment: See carretero:*, rosales:cds, perez:clfs.

carretero:lfs:
J. Carretero, F. Pérez, P. de Miguel, F. Garc\'\ia, and L. Alonso. LFS design: A parallel file server for multicomputers. Technical Report FIM/81.1/DATSI/94, Universidad Politecnic Madrid, Madrid, Spain, 1994.

Abstract: This document describes the detailed design of the LFS, one of the components of the Cache Coherent File System (CCFS). CCFS has three main components: Client File Server (CLFS), Local File Server (LFS), Concurrent Disk System (CDS). The Local File Servers are located on each disk node, to develop file server functions in a per node basis. The LFS will interact with the Concurrent Disk System (CDS) to execute real input/output and to manage the disk system, partitions, distributed partitions, etc. The LFS includes general file system services and specialized services, and it will be responsible of maintaining cache consistency, distributing accesses to other servers, controlling partition information, etc.

Keywords: multiprocessor file system, parallel I/O, pario-bib

Comment: See carretero:*, rosales:cds, perez:clfs.

carretero:mapping:
J. Carretero, F. Pérez, P. de Miguel, F. Garc\'\ia, and L. Alonso. I/O data mapping in \em ParFiSys: support for high-performance I/O in parallel and distributed systems. In Euro-Par '96, volume 1123 of Lecture Notes in Computer Science, pages 522-526. Springer-Verlag, August 1996.

Abstract: This paper gives an overview of the I/O data mapping mechanisms of ParFiSys. Grouped management and parallelization are presented as relevant features. I/O data mapping mechanisms of ParFiSys, including all levels of the hierarchy, are described in this paper.

Keywords: parallel I/O, multiprocessor file system, pario-bib

carretero:parfisys:
J. Carretero, F. Pérez, P. de Miguel, F. Garc\'\ia, and L. Alonso. ParFiSys: A parallel file system for MPP. ACM Operating Systems Review, 30(2):74-80, April 1996.

Keywords: multiprocessor file system, parallel I/O, pario-bib

Comment: See carretero:*, rosales:cds, perez:clfs.

carretero:performance:
J. Carretero, F. Pérez, P. de Miguel, F. García, and L. Alonso. Performance increase mechanisms for parallel and distributed file systems. Parallel Computing, 23(4):525-542, June 1997.

Keywords: parallel I/O, multiprocessor file system, pario-bib

carretero:posix:
J. Carretero, F. Pérez, P. de Miguel, F. Garc\'\ia, and L. Alonso. Prototype POSIX-style parallel file server and report for the CS-2. Technical Report D1.7/1, Universidad Politecnic Madrid, Madrid, Spain, 1993.

Keywords: multiprocessor file system, parallel I/O, pario-bib

Comment: See carretero:*, rosales:cds, perez:clfs.

carretero:posix-final:
J. Carretero, F. Pérez, P. de Miguel, F. Garc\'\ia, and L. Alonso. POSIX-style parallel file server for the GPMIMD: Final report. Technical Report D1.7/2, Universidad Politecnic Madrid, Madrid, Spain, 1995.

Keywords: multiprocessor file system, parallel I/O, pario-bib

Comment: See carretero:*, rosales:cds, perez:clfs.

carretero:subsystem:
J. Carretero, F. Pérez, P. de Miguel, F. Garc\'\ia, and L. Alonso. A massively parallel and distributed I/O subsystem. Computer Architecture News, 24(3):1-8, June 1996.

Keywords: parallel I/O, I/O architecture, pario-bib

Comment: See carretero:*, rosales:cds, perez:clfs.

carter:benchmark:
Russell Carter, Bob Ciotti, Sam Fineberg, and Bill Nitzberg. NHT-1 I/O benchmarks. Technical Report RND-92-016, NAS Systems Division, NASA Ames, November 1992.

Keywords: parallel I/O, benchmark, pario-bib

Comment: Specs for three scalable-I/O benchmarks to be used for evaluating I/O for multiprocessors. One measures application I/O by mixing I/O and computation, one measures max disk I/O by reading and writing 80% of the total RAM memory, and the last one is for sending that data from the file system, through the network, and back. See fineberg:nht1.

carter:vesta:
Matthew P. Carter and David Kotz. An implementation of the Vesta parallel file system API on the Galley parallel file system. Technical Report PCS-TR98-329, Dept. of Computer Science, Dartmouth College, April 1998.

Abstract: To demonstrate the flexibility of the Galley parallel file system and to analyze the efficiency and flexibility of the Vesta parallel file system interface, we implemented Vesta's application-programming interface on top of Galley. We implemented the Vesta interface using Galley's file-access methods, whose design arose from extensive testing and characterization of the I/O requirements of scientific applications for high-performance multiprocessors. We used a parallel CPU, parallel I/O, out-of-core matrix-multiplication application to test the Vesta interface in both its ability to specify data access patterns and in its run-time efficiency. In spite of its powerful ability to specify the distribution of regular, non-overlapping data access patterns across disks, we found that the Vesta interface has some significant limitations. We discuss these limitations in detail in the paper, along with the performance results.

Keywords: parallel I/O, multiprocessor file system, pario-bib, dfk

Comment: See also nils/galley.html

catania:array:
v. Catania, A. Puliafito, S. Riccobene, and L. Vita. Performance evaluation of a partial dynamic declustering disk array system. In Proceedings of the Third IEEE International Symposium on High Performance Distributed Computing, pages 244-252, August 1994.

Abstract: With a view to improving the performance and the fault tolerance of mass storage units, this paper concentrates on the architectural issues of parallelizing I/O access and a disk array system by means of definition of a new, particularly flexible architecture, called Partial Dynamic Declustering, which is fault-tolerant and offers higher levels of performance and reliability than the solutions normally used. A fast distributed algorithm based on a dynamic structure and usable for the implementation of an efficient I/O subsystem manager is proposed. Particular attention is also paid to the definition of analytical models based on Stochastic Reward Petri nets in order to analyze the performance and reliability of the system proposed.

Keywords: parallel I/O, disk array, pario-bib

catania:disk-array:
V. Catania, A. Puliafito, S. Riccobene, and L. Vita. Design and performance analysis of a disk array system. IEEE Transactions on Computers, 44(10):1236-1247, October 1995.

Abstract: We concentrate on the architectural issues of parallelizing I/O access in a disk array system by means of definition of a new, particularly flexible architecture, called partial dynamic declustering, which is fault-tolerant and offers higher levels of performance and reliability than the solutions normally used. A simulation analysis highlights the efficiency of the proposed solution in balancing the file system workload and demonstrates its validity in both cases of unbalanced loads and expansion of the system. Particular attention is also paid to the definition of analytical models, based on stochastic reward nets, in order to analyze the performance and reliability of the system. The response time distribution function is evaluated and a specific performance analysis with varying degrees of declustering and workload is carried out.

Keywords: parallel I/O, disk array, pario-bib

catania:mass:
V. Catania, A. Puliafito, S. Riccobene, and L. Vita. An I/O subsystem supporting mass storage functions in parallel systems. Computer standards & interfaces, 18(2):117-138, 1996.

Abstract: The introduction of multiprocessor architectures into computer systems has further increased the gap between processing times and access times to mass memories, thus making the processes more and more I/O-bound. To provide higher performance levels (both transfer rate and I/O rate), disk array technology is based on the use of a number of logically interconnected disks of a small size, in order to replace disks which have a large capacity but are very expensive. With a view to improving the performance and fault tolerance of the mass storage units, this paper concentrates on the architectural issues of parallelizing I/O access in a disk array system by means of definition of a new, particularly flexible architecture, called Partial Dynamic Declustering, which is fault-tolerant and offers higher levels of performance and reliability than the solutions normally used. A fast distributed algorithm based on a dynamic structure and usable for the implementation of an efficient I/O subsystem manager is proposed and evaluated by a simulative analysis. A specific study also characterizes the system's performance with varying degrees of declustering and workload types (from the transactional to the scientific type). The results obtained allow us to obtain the optimal configuration of the system (number of disks per group) which will ensure the desired response time values for varying workloads.

Keywords: parallel I/O, pario-bib

cecchet:raidb:
Emmanuel Cecchet, Julie Marguerite, and Willy Zwaenepoel. Partial replication: Achieving scalability in redundant arrays of inexpensive databases. Lecture Notes in Computer Science, 3144:58-70, July 2004.

Abstract: Clusters of workstations become more and more popular to power data server applications such as large scale Web sites or e-Commerce applications. There has been much research on scaling the front tiers (web servers and application servers) using clusters, but databases usually remain on large dedicated SMP machines. In this paper, we focus on the database tier using clusters of commodity hardware. Our approach consists of studying different replication strategies to achieve various degree of performance and fault tolerance.  Redundant Array of Inexpensive Databases (RAIDb) is to databases what RAID is to disks. In this paper, we focus on RAIDb-1 that offers full replication and RAIDb-2 that introduces partial replication, in which the user can define the degree of replication of each database table.  We present a Java implementation of RAIDb called Clustered JDBC or C-JDBC. C-JDBC achieves both database performance scalability and high availability at the middleware level without changing existing applications. We show, using the TPC-W benchmark, that partial replication (RAIDb-2) can offer better performance scalability (up to 25%) than full replication by allowing fine-grain control on replication. Distributing and restricting the replication of frequently written tables to a small set of backends reduces I/O usage and improves CPU utilization of each cluster node.

Keywords: replication strategies, RAIDb, database, pario-bib

cerin:sorting:
Christophe Cérin, Hazem Fkaier, and Mohamed Jemni. A synthesis of parallel out-of-core sorting programs on heterogeneous clusters. In Proceedings of the Third IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 78-85, Tokyo, Japan, May 2003. IEEE Computer Society Press.

Abstract: The paper considers he problem of parallel external sorting in the contex of a form of heterogeneous clusters. We introduce two algorithms and we compare them two another one that we have previously developed. Since most common sort algorithms assume high-speed random access to all intermediate memory, they are unsuitable if the values to be sorted don't fit in main memory. This is the case for cluster computing platforms which are made of standard, cheap and scarce components. For that class of computing resources a good use of I/O operations compatible with the requirements of load balancing and computational complexity are the key to success. We explore three techniques and show how they can be deployed for clusters with processor performances related by a multiplicative factor. We validate the approaches in showing experimental results for the load balancing factor.

Keywords: out-of-core, sorting, parallel I/O, load balancing, data distribution, pario-app, pario-bib

ceron:dna:
C. Ceron, J. Dopazo, E. L. Zapata, J.M. Carazo, and O. Trelles. Parallel implementation of DNAml program on message-passing architectures. Parallel Computing, 24(5-6):701-716, June 1997.

Abstract: We present a new computing approach for the parallelization on message-passing computer architectures of the DNAml algorithm, one of the most powerful tools available for constructing phylogenetic trees from DNA sequences. An analysis of the data dependencies of the method gave little chances to develop an efficient parallel approach. However, a careful run-time analysis of the behaviour of the algorithm allowed us to propose a very efficient parallel implementation based on the combination of advanced dynamic scheduling strategies, speculative running-time execution decisions and I/O buffering. In this work, we discuss specific Parallel Virtual Machine (PVM)-based implementations for a cluster of workstations and for Distributed Memory multiprocessors, with high performance results. The code can be obtained from our public-domain sites.

Keywords: parallel computers, run-time analysis, phylogenetic trees, DNAml program, source code, parallel I/O, pario-bib

Comment: They discuss the parallelization on message-passing computers of the {DNA}ml algorithm, a tool used to construct phylogenetic trees from {DNA} sequences. By performing a run-time analysis of the behavior of the algorithm they came up with an efficient parallel implementation based on dynamic scheduling strategies, speculative run-time execution decisions and I/O buffering. They use I/O buffering (prefetching) to fetch tasks that need to be processed. The parallel code was written in C using PVM for message passing and is avaialable via anonymous ftp at ftp.ac.uma.es.

cfs:lustre:
Lustre: A scalable, high-performance file system. Cluster File Systems Inc. white paper, version 1.0, November 2002. http://www.lustre.org/docs/whitepaper.pdf.

Keywords: object-based storage, distributed file system, parallel file system, pario-bib

Comment: Describes an open-source project to develop an object-based file system for clusters. Related to the NASD project at CMU (http://www.pdl.cs.cmu.edu/NASD/).

cha:subgroup:
Kwangho Cha, Taeyoung Hong, and Jeongwoo Hong. The subgroup method for collective I/O. Lecture Notes in Computer Science, 3320:301-304, December 2004.

Abstract: Because many scientific applications require large data processing, the importance of parallel I/O has been increasingly recognized. For collective I/O, one of the considerable features of parallel I/O, we suggest the subgroup method. It is the way of using collective I/O of MPI effectively in terms of application programs. From the experimental results, we could conclude that the subgroup method for collective I/O is more efficient than plain collective I/O.

Keywords: collective I/O, MPI subgroup, pario-bib

chandy:array:
John A. Chandy and Prithviraj Benerjee. Reliability evaluation of disk array architectures. In Proceedings of the 1993 International Conference on Parallel Processing, pages I-263-267, St. Charles, IL, 1993. CRC Press.

Keywords: parallel I/O, disk array, pario-bib, RAID

Comment: A framework for evaluating the reliability of RAIDs. They consider failure and repair rates that depend on the workload.

chang:reuse:
Tai-Sheng Chang, Sangyup Shim, and David H. C. Du. The scalability of spatial reuse based serial storage interfaces. In Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed Systems, pages 93-101, San Jose, CA, November 1997. ACM Press.

Abstract: Due to the growing popularity of emerging applications such as digital libraries, Video-On Demand, distance learning, and Internet World-Wide Web, multimedia servers with a large capacity and high performance storage subsystem are in high demand. Serial storage interfaces are emerging technologies designed to improve the performance of such storage subsystems. They provide high bandwidth, fault tolerance, fair bandwidth sharing and long distance connection capability. All of these issues are critical in designing a scalable and high performance storage subsystem. Some of the serial storage interfaces provide the spatial reuse feature which allows multiple concurrent transmissions. That is, multiple hosts can access disks concurrently with full link bandwidth if their access paths are disjoint. Spatial reuse provides a way to build a storage subsystem whose aggregate bandwidth may be scaled up with the number of hosts. However, it is not clear how much the performance of a storage subsystem could be improved by the spatial reuse with different configurations and traffic scenarios. Both limitation and capability of this scalability need to be investigated. To understand their fundamental performance characteristics, we derive an analytic model for the serial storage interfaces with the spatial reuse feature. Based on this model, we investigate the maximum aggregate throughput from different system configurations and load distributions. We show how the number of disks needed to saturate a loop varies with different number of hosts and different load scenarios. We also show how the load balancing by uniformly distributing the load to all the disks on a loop may incur high overhead. This is because the accesses to far away disks need to go through many links and consume the bandwidth of each link it goes through. The results show the achievable throughput may be reduced by more than half in some cases.

Keywords: I/O interface, I/O network, I/O architecture, parallel I/O, pario-bib

chang:titan:
Chialin Chang, Bongki Moon, Anurag Acharya, Carter Shock, Alan Sussman, and Joel Saltz. Titan: a high-performance remote-sensing database. In Proceedings of the Thirteenth International Conference on Data Engineering, Birmingham, U.K., April 1997.

Abstract: There are two major challenges for a high-performance remote-sensing database. First, it must provide low-latency retrieval of very large volumes of spatio-temporal data. This requires effective declustering and placement of a multi-dimensional dataset onto a large disk farm. Second, the order of magnitude reduction in data-size due to post-processing makes it imperative, from a performance perspective, that the postprocessing be done on the machine that holds the data. This requires careful coordination of computation and data retrieval. This paper describes the design, implementation and evaluation of Titan, a parallel shared-nothing database designed for handling remote-sensing data. The computational platform for Titan is a 16-processor IBM SP-2 with four fast disks attached to each processor. Titan is currently operational and contains about 24 GB of AVHRR data from the NOAA-7 satellite. The experimental results show that Titan provides good performance for global queries and interactive response times for local queries.

Keywords: parallel databases, satellite imagery, remote sensing, parallel I/O, pario-bib

chao:datamesh:
Chia Chao, Robert English, David Jacobson, Bart Sears, Alexander Stepanov, and John Wilkes. DataMesh architecture 1.0. Technical Report HPL-92-153, HP Labs, December 1992.
See also earlier version wilkes:datamesh.

Keywords: parallel I/O, parallel file system, pario-bib

Comment: A more detailed spec of the datamesh architecture, specifying components and operations. It is a block server where blocks are associatively addressed by tags. Some search operations are supported, as are atomic tag-changing operations. See also cao:tickertaip, wilkes:datamesh1, wilkes:datamesh, wilkes:houses, wilkes:lessons.

chapple:pario:
S. R. Chapple and R. A. Fletcher. PUL-GF Parallel I/O Concepts. Edinburgh Parallel Computing Center, February 1993. EPCC-KTP-PUL-GF-PROT-CONC 1.0.

Keywords: parallel I/O, pario-bib

Comment: See also bruce:chimp, chapple:pulgf, and chapple:pulgf-adv, for general information on CHIMP and PUL-GF. This document is an exploration of the potential ways to parallelize the underlying I/O support for the PUL-GF interface. They reason about tradeoffs in the number of servers, disks, and clients, but (as they note) without any performance evaluation to back it up. In particular, they argue that there should be one partition per disk, one server per disk, and probably one client to many servers, or many clients to many servers. A key assumption is that a traditional serial file system is the home location for files, and that files are ``converted'' into parallel files (or vice versa) by replicating or distributing them. Application could choose the number of servers (and hence disks) for each file. Hints could be provided about many things. Interesting idea to allow user hooks for cache prefetch and writeback functions. Support for variable-length records (``atoms'') is a key component. Segments of a file with different formats, e.g., a header and a matrix, may be separated into different components when the file is distributed into parallel form. See chapple:pulpf for info on the eventual realization of these ideas.

chapple:pulgf:
S. R. Chapple and S. M. Trewin. PUL-GF Prototype User Guide. Edinburgh Parallel Computing Center, February 1993. EPCC-KTP-PUL-GF-UG 0.1.

Keywords: parallel I/O, pario-bib

Comment: PUL is a set of libraries that run on top of the CHIMP portable message-passing library (see bruce:chimp). One of the PUL libraries is PUL-GF, to support file I/O. The underlying I/O support is not parallel (but see chapple:pario). The interface is parallel, however; in particular, it supports file modes similar to those used in many systems, which they call single, multi, random, and independent. Formatted and unformatted, synchronous and asynchronous. Very general multidimensional-array read and write functions. Ability to group multiple I/O requests into atomic units, though not a full transaction capability. See also chapple:pulgf-adv and chapple:pario.

chapple:pulgf-adv:
S. R. Chapple. PUL-GF Prototype Advanced User Guide. Edinburgh Parallel Computing Center, January 1993. EPCC-KTP-PUL-GF-PROT-ADV-UG 0.1.

Keywords: parallel I/O, pario-bib

Comment: See chapple:pulgf for a definition of PUL-GF. This document describes the internal client-server interface to PUL-GF, including ways that users can extend the functionality of PUL-GF. In particular, they give an example of how a new file format (a run-length encoded 2-d matrix) can be read and written transparently as if it were a plain matrix file. The extensibility is offered by run-time registration of user-defined interposition functions, to be called at key moments in the processing of a file I/O request. See also bruce:chimp and chapple:pario.

chapple:pulpf:
S. R. Chapple. PUL-PF Reference Manual. Edinburgh Parallel Computing Center, January 1994. EPCC-KTP-PUL-PF-PROT-RM 1.1.

Keywords: parallel I/O, multiprocessor file system interface, pario-bib

Comment: See also chapple:pulgf and chapple:pario. An evolution of their parallel I/O interface. PUL-PF is a library on top of existing file systems. Every process is either a client or a server; servers write some portion of the file to a file in the file system. Servers can be divided into groups so that files need not be spread across all servers. There seems to be client caching, with consistency controlled differently depending on access mode; when necessary, the application must call get-token and send-token commands to serialize access to an atom. Independently of their single, multi, random, and independent mode, they can read or write the next, previous, current, or ``wild'' atom (wild means the next ``most available'' atom not yet read by this process). Most I/O is on atoms, but particles (pieces of atoms) can also be independently read and written. Hints are supported to specify access pattern (random or sequential, stride), file partitioning, mapping, atom size, or caching. In many of those cases it goes beyond a hint to the supply of a user-defined function, e.g., for cache-replacement algorithm.

chaudhry:relaxing:
Geeta Chaudhry and Thomas H. Cormen. Relaxing the problem-size bound for out-of-core columnsort. Technical Report TR2003-445, Dept. of Computer Science, Dartmouth College, Hanover, NH, April 2003.

Abstract: Previous implementations of out-of-core columnsort limit the problem size to $N \leq \sqrt{(M/P)^3 / 2}$, where $N$ is the number of records to sort, $P$ is the number of processors, and $M$ is the total number of records that the entire system can hold in its memory (so that $M/P$ is the number of records that a single processor can hold in its memory). We implemented two variations to out-of-core columnsort that relax this restriction. Subblock columnsort is based on an algorithmic modification of the underlying columnsort algorithm, and it improves the problem-size bound to $N \leq (M/P)^{5/3} / 4^{2/3}$ but at the cost of additional disk I/O\. $M$-columnsort changes the notion of the column size in columnsort.

Keywords: parallel I/O, sorting, out-of-core applications, pario-bib

chaudhry:thesis:
Geeta Chaudhry. Parallel Out-of-Core Sorting: The Third Way. PhD thesis, Dartmouth College, Hanover, NH, September 2004. Available as Dartmouth Technical Report TR2004-517.

Abstract: Sorting very large datasets is a key subroutine in almost any application that is built on top of a large database. Two ways to sort out-of-core data dominate the literature: merging-based algorithms and partitioning-based algorithms. Within these two paradigms, all the programs that sort out-of-core data on a cluster rely on assumptions about the input distribution. We propose a third way of out-of-core sorting: oblivious algorithms. In all, we have developed six programs that sort out-of-core data on a cluster. The first three programs, based completely on Leighton's columnsort algorithm, have a restriction on the maximum problem size that they can sort. The other three programs relax this restriction; two are based on our original algorithmic extensions to columnsort. We present experimental results to show that our algorithms perform well. To the best of our knowledge, the programs presented in this thesis are the first to sort out-of-core data on a cluster without making any simplifying assumptions about the distribution of the data to be sorted.

Keywords: out-of-core sorting, columnsort, cluster computing, parallel I/O, pario-bib

Comment: Doctoral dissertation. Advisor: Thomas H. Cormen

chaudhry:tricks:
Geeta Chaudhry, Elizabeth A. Hamon, and Thomas H. Cormen. Stupid columnsort tricks. Technical Report TR2003-444, Dept. of Computer Science, Dartmouth College, Hanover, NH, April 2003.

Abstract: Leighton's columnsort algorithm sorts on an $r \times s$ mesh, subject to the restrictions that $s$ is a divisor of $r$ and that $r \geq 2s^2$ (so that the mesh is tall and thin). We show how to mitigate both of these restrictions. One result is that the requirement that $s$ is a divisor of $r$ is unnecessary; columnsort sorts correctly whether or not $s$ divides $r$. We present two algorithms that, as long as $s$ is a perfect square, relax the restriction that $r \geq 2s^2$; both reduce the exponent of $s$ to $3/2$. One algorithm requires $r \geq 4s^{3/2}$ if $s$ divides $r$ and $r \geq 6s^{3/2}$ if $s$ does not divide $r$. The other algorithm requires $r \geq 4^{3/2}$, and it requires $s$ to be a divisor of $r$. Both algorithms have applications in increasing the maximum problem size in out-of-core sorting programs.

Keywords: parallel I/O, sorting, out-of-core applications, pario-bib

chehadeh:oodb:
Y. C. Chehadeh, A. R. Hurson, L. L. Miller, S. Pakzad, and B. N. Jamoussi. Application for parallel disks for efficient handling of object-oriented databases. In Proceedings of the 1993 IEEE Symposium on Parallel and Distributed Processing, pages 184-191. IEEE Computer Society Press, 1993.

Abstract: In today's workstation based environment, applications such as design databases, multimedia databases, and knowledge bases do not fit well into the relational data processing framework. The object-oriented data model has been proposed to model and process such complex databases. Due to the nature of the supported applications, object-oriented database systems need efficient mechanisms for the retrieval of complex objects and the navigation along the semantic links among objects. Object clustering and buffering have been suggested as efficient mechanisms for the retrieval of complex objects. However, to improve the efficiency of the aforementioned operations, one has to look at the recent advances in storage technology. This paper is an attempt to investigate the feasibility of using parallel disks for object-oriented databases. It analyzes the conceptual changes needed to map the clustering and buffering schemes proposed on the new underlying architecture. The simulation and performance evaluation of the proposed leveled-clustering and mapping schemes utilizing parallel I/O disks are presented and analyzed.

Keywords: parallel I/O, disk array, object oriented database, pario-bib

chen:automatic:
Ying Chen, Marianne Winslett, Y. Cho, and S. Kuo. Automatic parallel I/O performance optimization using genetic algorithms. In Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, pages 155-162. IEEE Computer Society Press, July 1998.

Abstract: The complexity of parallel I/O systems imposes significant challenge in managing and utilizing the available system resources to meet application performance, portability and usability goals. We believe that a parallel I/O system that automatically selects efficient I/O plans for user applications is a solution to this problem. In this paper, we present such an automatic performance optimization approach for scientific applications performing collective I/O requests on multidimensional arrays. The approach is based on a high level description of the target workload and execution environment characteristics, and applies genetic algorithms to select high quality I/O plans. We have validated this approach in the Panda parallel I/O library. Our performance evaluations on the IBM SP show that this approach can select high quality I/O plans under a variety of system conditions with a low overhead, and the genetic algorithm-selected I/O plans are in general better than the default plans used in Panda.

Keywords: parallel I/O, performance optimization, genetic algorithm, pario-bib

chen:collective:
Ying Chen, Ian Foster, Jarek Nieplocha, and Marianne Winslett. Optimizing collective I/O performance on parallel computers: A multisystem study. In Proceedings of the 11th ACM International Conference on Supercomputing, pages 28-35. ACM Press, July 1997.

Keywords: collective I/O, multiprocessor file system, parallel I/O, pario-bib

chen:eval:
Peter Chen, Garth Gibson, Randy Katz, and David Patterson. An evaluation of redundant arrays of disks using an Amdahl 5890. In Proceedings of the 1990 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 74-85, May 1990.

Keywords: parallel I/O, RAID, disk array, pario-bib

Comment: A experimental validation of the performance predictions of patterson:raid, plus some extensions. Confirms that RAID level 5 (rotated parity) is best for large read/writes, and RAID level 1 (mirroring) is best for small reads/writes.

chen:maxraid:
Peter M. Chen and David A. Patterson. Maximizing performance in a striped disk array. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 322-331, 1990.

Keywords: parallel I/O, RAID, disk striping, pario-bib

Comment: Choosing the optimal striping unit, i.e., size of contiguous data on each disk (bit, byte, block, etc.). A small striping unit is good for low-concurrency workloads since it increases the parallelism applied to each request, but a large striping unit can support high-concurrency workloads where each independent request depends on fewer disks. They do simulations to find throughput, and thus to pick the striping unit. They find equations for the best compromise striping unit based on the concurrency and the disk parameters, or on the disk parameters alone. Some key assumptions may limit applicability, but this is not addressed.

chen:panda:
Y. Chen, M. Winslett, K. E. Seamons, S. Kuo, Y. Cho, and M. Subramaniam. Scalable message passing in Panda. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 109-121, Philadelphia, May 1996. ACM Press.

Abstract: To provide high performance for applications with a wide variety of i/o requirements and to support many different parallel platforms, the design of a parallel i/o system must provide for efficient utilization of available bandwidth both for disk traffic and for message passing. In this paper we discuss the message-passing scalability of the server-directed i/o architecture of Panda, a library for synchronized i/o of multidimensional arrays on parallel platforms. We show how to improve i/o performance in situations where message-passing is a bottleneck, by combining the server-directed i/o strategy for highly efficient use of available disk bandwidth with new mechanisms to minimize internal communication and computation overhead in Panda. We present experimental results that show that with these improvements, Panda will provide high i/o performance for a wider range of applications, such as applications running with slow interconnects, applications performing i/o operations on large numbers of arrays, or applications that require drastic data rearrangements as data are moved between memory and disk (e.g., array transposition). We also argue that in the future, the improved approach to message-passing will allow Panda to support applications that are not closely synchronized or that run in heterogeneous environments.

Keywords: parallel I/O, parallel file system, pario-bib

Comment: see seamons:panda. This paper goes further with some communication improvements.

chen:panda-automatic:
Y. Chen, M. Winslett, Y. Cho, and S. Kuo. Automatic parallel I/O performance optimization in Panda. In Proceedings of the Eleventh Symposium on Parallel Algorithms and Architectures, pages 108-118, 1998.

Abstract: Parallel I/O systems typically consist of individual processors, communication networks, and a large number of disks. Managing and utilizing these resources to meet performance, portability and usability goals of applications has become a significant challenge. We believe that a parallel I/O system that automatically selects efficient I/O plans for user applications is a solution to this problem. In this paper, we present such an automatic performance optimization approach for scientific applications performing collective I/O requests on multidimensional arrays. Under our approach, as optimization engine in a parallel I/O system selects optimal I/O plans automatically without human intervention based on a description of the application I/O requests and the system configuration. To validate our hypothesis, we have built an optimizer that uses a rule-based and randomized search-based algorithms to select optimal parameter settings in Panda, a parallel I/O library for multidimensional arrays. Our performance results obtained from two IBM SPs with significantly different configurations show that the Panda optimizer is able to select high-quality I/O plans and deliver high performance under a variety of system configurations

Keywords: parallel I/O, Panda, portability, pario-bib

chen:panda-model:
Y. Chen, M. Winslett, S. Kuo, Y. Cho, M. Subramaniam, and K. E. Seamons. Performance modeling for the Panda array I/O library. In Proceedings of Supercomputing '96. ACM Press and IEEE Computer Society Press, November 1996.

Abstract: We present an analytical performance model for Panda, a library for synchronized i/o of large multidimensional arrays on parallel and sequential platforms, and show how the Panda developers use this model to evaluate Panda's parallel i/o performance and guide future Panda development. The model validation shows that system developers can simplify performance analysis, identify potential performance bottlenecks, and study the design trade-offs for Panda on massively parallel platforms more easily than by conducting empirical experiments. More importantly, we show that the outputs of the performance model can be used to help make optimal plans for handling application i/o requests, the first step toward our long-term goal of automatically optimizing i/o request handling in Panda.

Keywords: performance modeling, parallel I/O, pario-bib

Comment: On Web and CDROM only. They derive a detailed but fairly simple model of the Panda 2.0.5 parallel I/O library, by carefully enumerating the costs involved in a collective I/O operation. They measure Panda, AIX, and MPI to obtain parameters, and then they validate the model by comparison with the actual Panda implementation running a basic benchmark and an actual application. The model predicts the benchmark performance very well, and is as much as 20% off on the performance of the application. They have embedded the performance model in a "simulator", which predicts the performance of a given sequence of collective I/O requests, and they plan to use it in future versions of Panda to formulate I/O plans by predicting the performance resulting from several different Panda parameter settings, and choosing the best.

chen:raid:
Peter Chen, Garth Gibson, Randy Katz, David Patterson, and Martin Schulze. Two papers on RAIDs. Technical Report UCB/CSD 88/479, UC Berkeley, December 1988.
See also later version schulze:raid2.

Keywords: parallel I/O, RAID, disk array, pario-bib

Comment: Basically an updated version of patterson:raid and the prepublished version of gibson:failcorrect.

chen:raid-perf:
S. Chen and D. Towsley. A performance evaluation of RAID architectures. IEEE Transactions on Computers, 45(10):1116-1130, October 1996.

Keywords: parallel I/O, RAID, disk array, pario-bib

chen:raid-survey:
Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, and David A. Patterson. RAID: high-performance, reliable secondary storage. ACM Computing Surveys, 26(2):145-185, June 1994.

Keywords: RAID, disk array, parallel I/O, survey, pario-bib

Comment: An excellent overview of RAID concepts and technology. It starts from the beginning with a discussion of disk hardware, RAID basics, etc, and then goes on to discuss some of the more advanced features. They also describe a few RAID implementations. Basically, it is a perfect paper to read for folks new to RAID.

chen:raid2:
Peter M. Chen, Edward K. Lee, Ann L. Drapeau, Ken Lutz, Ethan L. Miller, Srinivasan Seshan, Ken Shirriff, David A. Patterson, and Randy H. Katz. Performance and design evaluation of the RAID-II storage server. In Proceedings of the IPPS '93 Workshop on Input/Output in Parallel Computer Systems, pages 110-120, Newport Beach, CA, 1993.
See also later version drapeau:raid-ii.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: A special back-end box for a Sun4 file server, that hooks a HIPPI network through a crossbar to fast memory, a parity engine, and a bunch of disks on SCSI. They pulled about 20 MB/s through it, basically disk-limited; with more disks they would hit 32-40 MB/s. Much improved over RAID-I, which was limited by the memory bandwidth of the Sun4 server.

chen:raid5stripe:
Peter Chen and Edward K. Lee. Striping in a RAID level 5 disk array. In Proceedings of the 1995 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 136-145, May 1995.

Keywords: disk array, striping, RAID, pario-bib

chen:tuning:
Ying Chen and Marianne Winslett. Automated tuning of parallel I/O systems: An approach to portable I/O performance for scientific applications. IEEE Transactions on Software Engineering, 26(4):362-383, April 2000.

Abstract: Parallel I/O systems typically consist of individual processors, communication networks, and a large number of disks. Managing and utilizing these resources to meet performance, portability, and usability goals of high performance scientific applications has become a significant challenge. For scientists, the problem is exacerbated by the need to retune the I/O portion of their code for each supercomputer platform where they obtain access. We believe that a parallel I/O system that automatically selects efficient I/O plans for user applications is a solution to this problem. The authors present such an approach for scientific applications performing collective I/O requests on multidimensional arrays. Under our approach, an optimization engine in a parallel I/O system selects high quality I/O plans without human intervention, based on a description of the application I/O requests and the system configuration. To validate our hypothesis, we have built an optimizer that uses rule based and randomized search based algorithms to tune parameter settings in Panda, a parallel I/O library for multidimensional arrays. Our performance results obtained from an IBM SP using an out-of-core matrix multiplication application show that the Panda optimizer is able to select high quality I/O plans and deliver high performance under a variety of system configurations with a small total optimization overhead.

Keywords: parallel I/O, pario-bib

chervenak:raid:
Ann L. Chervenak and Randy H. Katz. Performance of a disk array prototype. In Proceedings of the 1991 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 188-197, 1991.

Keywords: parallel I/O, disk array, performance evaluation, RAID, pario-bib

Comment: Measuring the performance of a RAID prototype with a Sun4/280, 28 disks on 7 SCSI strings, using 4 HBA controllers on a VME bus from the Sun. The found lots of bottlenecks really slowed them down. Under Sprite, the disks were the bottleneck for single disk I/O, single disk B/W, and string I/O. Sprite was a bottleneck for single disk I/O and String I/O. The host memory was a bottleneck for string B/W, HBA B/W, overall I/O, and overall B/W. With a simpler OS, that saved on data copying, they did better, but were still limited by the HBA, SCSI protocol, or the VME bus. Clearly they needed more parallelism in the busses and control system.

chervenak:raid-ii:
Ann L. Chervenak, Ken Shirriff, John H. Hartman, Ethan L. Miller, Srinivasan Seshan, Randy H. Katz, Ken Lutz, David A. Patterson, Edward K. Lee, Peter M. Chen, and Garth A.Gibson. RAID-II: A high-bandwidth network file server. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 26, pages 408-419. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version drapeau:raid-ii.

Keywords: RAID, disk array, network file system, parallel I/O, pario-bib

Comment: Part of jin:io-book; reformatted version of drapeau:raid-ii.

chervenak:tertiary:
Ann L. Chervenak. Challenges for tertiary storage in multimedia servers. Parallel Computing, 24(1):157-176, January 1998.

Keywords: parallel I/O, multimedia, tertiary storage, memory hierarchy, pario-bib

Comment: Part of a special issue.

chiang:graph:
Yi-Jen Chiang, , Michael T. Goodrich, Edward F. Grove, Roberto Tamassia, Darren Erik Vengroff, and Jeffrey Scott Vitter. External-memory graph algorithms (extended abstract). In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA '95), pages 139-149, January 1995.

Abstract: We present a collection of new techniques for designing and analyzing efficient external-memory algorithms for graph problems and illustrate how these techniques can be applied to a wide variety of specific problems. Our results include: \begin{itemize} \item Proximate-neighboring. We present a simple method for deriving external-memory lower bounds via reductions from a problem we call the ``proximate neighbors'' problem. We use this technique to derive non-trivial lower bounds for such problems as list ranking, expression tree evaluation, and connected components. \item PRAM simulation. We give methods for efficiently simulating PRAM computations in external memory, even for some cases in which the PRAM algorithm is not work-optimal. We apply this to derive a number of optimal (and simple) external-memory graph algorithms. \item Time-forward processing. We present a general technique for evaluating circuits (or ``circuit-like'' computations) in external memory. We also use this in a deterministic list ranking algorithm. \item Deterministic 3-coloring of a cycle. We give several optimal methods for 3-coloring a cycle, which can be used as a subroutine for finding large independent sets for list ranking. Our ideas go beyond a straightforward PRAM simulation, and may be of independent interest. \item External depth-first search. We discuss a method for performing depth first search and solving related problems efficiently in external memory. Our technique can be used in conjunction with ideas due to Ullman and Yannakakis in order to solve graph problems involving closed semi-ring computations even when their assumption that vertices fit in main memory does not hold. \end{itemize}

Our techniques apply to a number of problems, including list ranking, which we discuss in detail, finding Euler tours, expression-tree evaluation, centroid decomposition of a tree, least-common ancestors, minimum spanning tree verification, connected and biconnected components, minimum spanning forest, ear decomposition, topological sorting, reachability, graph drawing, and visibility representation.

Keywords: parallel I/O algorithm, graph algorithm, pario-bib

ching:efficient:
Avery Ching, Alok Choudhary, Wei keng Liao, Robert Ross, and William Gropp. Efficient structured data access in parallel file systems. In Proceedings of the IEEE International Conference on Cluster Computing, pages 326-335, Hong Kong, China, December 2003. IEEE Computer Society Press.

Abstract: Parallel scientific applications store and retrieve very large, structured datasets. Directly supporting these structured accesses is an important step in providing high-performance I/O solutions for these applications. High-level interfaces such as HDF5 and Parallel netCDF provide convenient APIs for accessing structured datasets, and the MPI-IO interface also supports efficient access to structured data. However, parallel file systems do not traditionally support such access. In this work, we present an implementation of structured data access support in the context of the Parallel Virtual File System (PVFS). We call this support "datatype I/O" because of its similarity to MPI datatypes. This support is built by using a reusable datatype-processing component from the MPICH2 MPI implementation. We describe how this component is leveraged to efficiently process structured data representations resulting from MPI-IO operations. We quantitatively assess the solution using three test applications. We also point to further optimizations in the processing path that could be leveraged for even more efficient operation.

Keywords: I/O interface, high-level libraries, PVFS, structured data representations, pario-bib

Comment: not read, don't have

ching:noncontiguous:
Avery Ching, Alok Choudhary, Kenin Coloma, Wei keng Liao, Robert Ross, and William Gropp. Noncontiguous I/O accesses through MPI-IO. In Proceedings of the Third IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 104-111, Tokyo, Japan, May 2003. IEEE Computer Society Press.

Abstract: I/O performance remains a weakness of parallel computing systems today. While this weakness is partly attributed to rapid advances in other system components, I/O interfaces available to programmers and the I/O methods supported by file systems have traditionally not matched efficiently with the types of I/O operations that scientific applications perform, particularly noncontiguous accesses. The MPI-IO interface allows for rich descriptions of the I/O patterns desired for scientific applications and implementations such as ROMIO have taken advantage of this ability while remaining limited by underlying file system methods.

A method of noncontiguous data access, list I/O, was recently implemented in the Parallel Virtual File System (PVFS). We implement support for this interface in the ROMIO MPI-IO implementation. Through a suite of non-contiguous I/O tests we compared ROMIO list I/O to current methods of ROMIO noncontiguous access and found that the list I/O interface provides performance benefits in many noncontiguous cases.

Keywords: parallel I/O, MPI-IO, ROMIO, list I/O, noncontiguous access, pario-bib

chiu:smart-disks:
Steve C. Chiu, Wei keng Liao, Alok N. Choudhary, and Malmut T. Kandemir. Processor-embedded distributed smart disks for I/O-intensive workloads: architectures, performance models and evaluation.. Journal of Parallel and Distributed Computing, 64(3):427-445, March 2004.

Abstract: Processor-embedded disks, or smart disks, with their network interface controller, can in effect be viewed as processing elements with on-disk memory and secondary storage. The data sizes and access patterns of today's large I/O-intensive workloads require architectures whose processing power scales with increased storage capacity. To address this concern, we propose and evaluate disk-based distributed smart storage architectures. Based on analytically derived performance models, our evaluation with representative workloads show that offloading processing and performing point-to-point data communication improve performance over centralized architectures. Our results also demonstrate that distributed smart disk systems exhibit desirable scalability and can efficiently handle I/O-intensive workloads, such as commercial decision support database (TPC-H) queries, association rules mining, data clustering, and two-dimensional fast Fourier transform, among others. (15 refs.)

Keywords: processor-embedded disks, smart disks, analytic performance models, I/O workload, pario-bib

chiueh:tapes:
Tzi-cker Chiueh. Performance optimization for parallel tape arrays. In Proceedings of the 9th ACM International Conference on Supercomputing, pages 375-384, Barcelona, July 1995. ACM Press.

Keywords: parallel I/O, tape striping, pario-bib

Comment: URL points to tech report version. He points out two problems with tape striping: that it is difficult to keep tape drives synchronized due to physical variations and to bad-segment remapping in the tape, and that the start-up cost is very high making it difficult to get multiple tapes loaded and started at the same time. So he proposes a 'triangular interleaving' rather than the traditional round-robin interleaving, coupled with lots of buffering, to deal with these problems. He also proposes to use different striping factors for different files (movies), depending on access characteristics. He includes parameters for some tape robots.

chiung-san:xdac:
Lee Chiung-San, Parng Tai-Ming, Lee Jew-Chin, Tsai Cheng-Nan, and Farn Kwo-Jean. Performance analysis of the XDAC disk array system. In Proceedings of the 1994 IEEE Symposium on Parallel and Distributed Processing, pages 620-627. IEEE Computer Society Press, 1994.

Abstract: The paper presents an analytical model of a whole disk array architecture, XDAC, which consists of several major subsystems and features: the two-dimensional array structure; IO-bus with split transaction protocol; and cache for processing multiple I/O requests in parallel. Our modelling approach is based on a subsystem access time per request (SATPR) concept, in which we model for each subsystem the mean access time per disk array request. The model is fed with a given set of representative workload parameters and then used to conduct performance analysis for exploring the impact of fork/join synchronization as well as evaluating some architectural design issues of the XDAC system. Moreover, by comparing the SATPRs of subsystems, we can identify the bottleneck for performance improvements.

Keywords: disk array, performance evaluation, analytical model, parallel I/O, pario-bib

cho:fine-grained:
Yong Cho, Marianne Winslett, Ying Chen, and Szu wen Kuo. Parallel I/O performance of fine grained data distributions. In Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, pages 163-170. IEEE Computer Society Press, July 1998.

Abstract: Fine grained data distributions are widely used to balance computational loads across compute processes in parallel scientific applications. When a fine grained data distribution is used in memory, performance of I/O intensive applications can be limited not only by disk speed but also by message passing, because a large number of small messages may be generated by the implementation strategy used in the underlying parallel file system or parallel I/O library. Combining (or packetizing) a set of small messages into a large message is generally known to speed up parallel I/O. However, overall I/O performance is affected not only by small messages but also by other factors like cyclic block size and interconnect characteristics. We describe small message combination and communication scheduling for fine grained data distributions in the Panda parallel I/O library and analyze I/O performance on parallel platforms having different interconnects: IBM SP2, IBM workstation cluster connected by FDDI and Pentium II cluster connected by Myrinet.

Keywords: parallel I/O, pario-bib

cho:local:
Yong Cho, Marianne Winslett, Mahesh Subramaniam, Ying Chen, Szu wen Kuo, and Kent E. Seamons. Exploiting local data in parallel array I/O on a practical network of workstations. In Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed Systems, pages 1-13, San Jose, CA, November 1997. ACM Press.

Abstract: A cost-effective way to run a parallel application is to use existing workstations connected by a local area network such as Ethernet or FDDI. In this paper, we present an approach for parallel I/O of multidimensional arrays on small networks of workstations with a shared-media interconnect, using the Panda I/O library.

In such an environment, the message passing throughput per node is lower than the throughput obtainable from a fast disk and it is not easy for users to determine the configuration which will yield the best I/O performance.

We introduce an I/O strategy that exploits local data to reduce the amount of data that must be shipped across the network, present experimental results, and analyze the results using an analytical performance model and predict the best choice of I/O parameters.

Our experiments show that the new strategy results in a factor of 1.2-2.1 speedup in response time compared to the Panda version originally developed for the IBM SP2, depending on the array sizes, distributions and compute and I/O node meshes. Further, the performance model predicts the results within a 13% margin of error.

Keywords: parallel I/O, distributed system, pario-bib

Comment: They examine a system that supports nodes that are both compute and I/O nodes. The assumption is that the application is writing data to a new file, and does not care to which disks the data goes. They are trying to decide which nodes should be used for I/O, given the distribution of data on compute nodes and the distribution desired across disks. They use a Hungarian algorithm to solve a weighted optimization problem on a bipartite graph connecting I/O nodes to compute nodes, in an attempt to minimize the data flow across the network. But there is no attempt to make a decision that might be sensible for a future read operation that may want to read in a different pattern.

choudhary:jmanagement:
A. Choudhary, M. Kandemir, J. No, G. Memik, X. Shen, W. Liao, H. Nagesh, S. More, V. Taylor, R. Thakur, , and R. Stevens. Data management for large-scale scientific computations in high performance distributed systems. Cluster Computing, 3(1):45-60, 2000.
See also earlier version choudhary:management.

Abstract: With the increasing number of scientific applications manipulating huge amounts of data, effective high-level data management is an increasingly important problem. Unfortunately, so far the solutions to the high-level data management problem either require deep understanding of specific storage architectures and file layouts (as in high-performance file storage systems) or produce unsatisfactory I/O performance in exchange for ease-of-use and portability (as in relational DBMSs). In this paper we present a novel application development environment which is built around an active meta-data management system (MDMS) to handle high-level data in an effective manner. The key components of our three-tiered architecture are user application, the MDMS, and a hierarchical storage system (HSS). Our environment overcomes the performance problems of pure database-oriented solutions, while maintaining their advantages in terms of ease-of-use and portability. The high levels of performance are achieved by the MDMS, with the aid of user-specified, performance-oriented directives. Our environment supports a simple, easy-to-use yet powerful user interface, leaving the task of choosing appropriate I/O techniques for the application at hand to the MDMS. We discuss the importance of an active MDMS and show how the three components of our environment, namely the application, the MDMS, and the HSS, fit together. We also report performance numbers from our ongoing implementation and illustrate that significant improvements are made possible without undue programming effort.

Keywords: cluster computing, scientific computing, parallel I/O, data management, pario-bib

choudhary:management:
A. Choudhary, M. Kandemir, H. Nagesh, J. No, X. Shen, V. Taylor, S. More, and R. Thakur. Data management for large-scale scientific computations in high performance distributed systems. In Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing, pages 263-272, Redondo Beach, CA, August 1999. IEEE Computer Society Press.
See also later version choudhary:jmanagement.

Abstract: With the increasing number of scientific applications manipulating huge amounts of data, effective data management is an increasingly important problem. Unfortunately, so far the solutions to this data management problem either require deep understanding of specific storage architectures and file layouts (as in high-performance file systems) or produce unsatisfactory I/O performance in exchange for ease-of-use and portability (as in relational DBMSs).

In this paper we present a new environment which is built around an active meta-data management system (MDMS). The key components of our three-tiered architecture are user application, the MDMS, and a hierarchical storage system (HSS). Our environment overcomes the performance problems of pure database-oriented solutions, while maintaining their advantages in terms of ease-of-use and portability.

The high levels of performance are achieved by the MDMS, with the aid of user-specified directives. Our environment supports a simple, easy-to-use yet powerful user interface, leaving the task of choosing appropriate I/O techniques to the MDMS. We discuss the importance of an active MDMS and show how the three components, namely application, the MDMS, and the HSS, fit together. We also report performance numbers from our initial implementation and illustrate that significant improvements are made possible without undue programming effort.

Keywords: cluster computing, scientific computing, parallel I/O, data management, pario-bib

Comment: They argue that existing parallel file systems are too low-level, they have their own set of I/O calls (non-portable), and policies are generally hard-coded into the system. Databases provide a portable layer on top of the file system, but they cannot provide high performance. They propose to "combine the advantages of file systems and databases, while avoiding their respective disadvantages." Their system is composed of a user program, a meta-data management system (MDMS), and a heirarchical storage system (HSS). The user program will query the MDMS to learn where in the HSS their data reside, what the performance of the storage system is, information about the acc data from the storage system, etc...

choudhary:passion:
Alok Choudhary, Rajesh Bordawekar, Michael Harry, Rakesh Krishnaiyer, Ravi Ponnusamy, Tarvinder Singh, and Rajeev Thakur. PASSION: parallel and scalable software for input-output. Technical Report SCCS-636, ECE Dept., NPAC and CASE Center, Syracuse University, September 1994.
See also later version thakur:jpassion.

Keywords: parallel I/O, out-of-core, pario-bib

Comment: This TR overviews the PASSION project, and all its components: two-phase access, out-of-core support for structured and unstructured problems, data sieving, prefetching, caching, compiler and language support, file system support, virtual parallel file system, and parallel pipes. They reference many of their related papers in an extensive bibliography. See also singh:adopt, jadav:ioschedule, thakur:passion, thakur:runtime, bordawekar:efficient, thakur:out-of-core, delrosario:prospects, delrosario:two-phase, bordawekar:primitives, bordawekar:delta-fs.

choudhary:passion-paragon:
Alok Choudhary, Rajesh Bordawekar, Sachin More, K. Sivaram, and Rajeev Thakur. PASSION runtime library for the Intel Paragon. In Proceedings of the Intel Supercomputer User's Group Conference, June 1995.
See also later version thakur:jpassion.

Abstract: We are developing a runtime library which provides a number of routines to perform the I/O required in parallel applications in an efficient and convenient manner. This is part of a project called PASSION, which aims to provide software support for high-performance parallel I/O at the compiler, runtime and file system levels. The PASSION Runtime Library uses a high-level interface which makes it easy for the user to specify the I/O required in the program. The user only needs to specify what portion of the data structure needs to read from or written to the file, and the PASSION routines will perform all the necessary I/O efficiently. This paper gives an overview of the PASSION Runtime Library and describes in detail its high-level interface.

Keywords: parallel I/O, runtime library, pario-bib

Comment: See also choudhary:passion.

choudhary:sdcr:
Alok Choudhary and David Kotz. Large-scale file systems with the flexibility of databases. ACM Computing Surveys, 28A(4), December 1996. Position paper for the Working Group on Storage I/O for Large-Scale Computing, ACM Workshop on Strategic Directions in Computing Research. Available on-line only, at http://doi.acm.org/10.1145/242224.242488.

Keywords: file system, database, parallel I/O, pario-bib, dfk

Comment: A position paper for the Strategic Directions in Computer Research workshop at MIT in June 1996. See gibson:sdcr and wegner:sdcr.

choudhary:sio-language:
Alok Choudhary, Ian Foster, Geoffrey Fox, Ken Kennedy, Carl Kesselman, Charles Koelbel, Joel Saltz, and Marc Snir. Languages, compilers, and runtime systems support for parallel input-output. Technical Report CCSF-39, Scalable I/O Initiative, Caltech Concurrent Supercomputing Facilities, Caltech, 1994.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: Language extensions to support parallel I/O. Compiler optimizations. Runtime library to support the compiler and interface with the native file system. Compiler would develop a mapping of data to the processor memories and to the disks, and then decide on I/O schedules to move data around, overlap I/O with computation, even move computation around to best fit what is available in memory at a given time. It can also help with checkpointing. Compiler should pass info to the runtime system, which in turn may need to pass info to the file system, to help with optimization. I/O scheduling includes reordering accesses; they even go so far as to propose doing seek optimization in the runtime library. Support for collective I/O. Extension of MPI to I/O, to take advantage of its support for asynchrony, scatter-gather, etc.\ On the way, they hope to work with the FS people to decide on the functional requirements of the file system. See also poole:sio-survey, bagrodia:sio-character, bershad:sio-os.

chung-sheng:arrays:
Li Chung-Sheng, Chen Ming-Syan, P. S. Yu, and Hsiao Hui-I. Combining replication and parity approaches for fault-tolerant disk arrays. In Proceedings of the 1994 IEEE Symposium on Parallel and Distributed Processing, pages 360-367. IEEE Computer Society Press, 1994.

Abstract: We explore the method of combining the replication and parity approaches to tolerate multiple disk failures in a disk array. In addition to the conventional mirrored and chained declustering methods, a method based on the hybrid of mirrored-and-chained declustering is explored. A performance study that explores the effect of combining replication and parity approaches is conducted. It is experimentally shown that the proposed approach can lead to the most cost-effective solution if the objective is to sustain the same load as before the failures.

Keywords: fault tolerance, disk array, replication, declustering, parallel I/O, pario-bib

Comment: Consider hybrid chained and mirrored declustering.

clark:molecular:
Terry W. Clark, L. Ridgway Scott, Stanislaw Wlodek, and J. Andrew McCammon. I/O limitations in parallel molecular dynamics. In Proceedings of Supercomputing '95, San Diego, CA, 1995. IEEE Computer Society Press.

Abstract: We discuss data production rates and their impact on the performance of scientific applications using parallel computers. On one hand, too high rates of data production can be overwhelming, exceeding logistical capacities for transfer, storage and analysis. On the other hand, the rate limiting step in a computationally-based study should be the human-guided analysis, not the calculation. We present performance data for a biomolecular simulation of the enzyme, acetylcholinesterase, which uses the parallel molecular dynamics program EulerGROMOS. The actual production rates are compared against a typical time frame for results analysis where we show that the rate limiting step is the simulation, and that to overcome this will require improved output rates.

Keywords: parallel I/O application, molecular dynamics, pario-bib

Comment: Note proceedings only on CD-ROM or WWW.

clement:overlap:
Mark J. Clement and Michael J. Quinn. Overlapping computations, communications and I/O in parallel sorting. Journal of Parallel and Distributed Computing, 28(2):162-172, August 1995.

Keywords: parallel I/O algorithm, sorting, pario-bib

Comment: They present a new parallel sorting algorithm that allows overlap between disk, network, and processor. By pipelining the tasks, they can double the speed of sorting; best results, of course, when these three components take approximately equal time. The disk I/O is really only used to load the initial data set and write the output data set, rather than being used for an external sorting scheme. They obtain their gains by overlapping that disk I/O with the communication and processing.

coelho:hpf-io:
Fabien Coelho. Compilation of I/O communications for HPF. In Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pages 102-109, 1995.

Keywords: parallel I/O, HPF, compiler, pario-bib

colarelli:allocate:
Dennis Colarelli and Gene Schumacher. New strategy for file allocation on multi-device systems. Supercomputing '93 poster session, 1993.

Keywords: file system, parallel I/O, disk layout, pario-bib

Comment: These two guys from NCAR redid the block allocation strategy routine on the Cray. Current strategy uses round-robin among the disks, using a different disk for each allocation request. Each request looks for blocks on that disk, until it is satisfied or space runs out, and then goes to the next disk. It uses a free-block bitmap to find the blocks. Problem: too many extents, not enough contiguity. These guys tried first-bit and best-fit from all extents on all disks. First-fit had faster allocation time, of course, and both had much lower file fragmentation. They also used the vector hardware to search the bitmap for non-zero words.

coleman:bottleneck:
Samuel S. Coleman and Richard W. Watson. New architectures to reduce I/O bottlenecks in high-performance systems. In Proceedings of the Twenty-Sixth Annual Hawaii International Conference on System Sciences, volume I, pages 5-14, 1993.

Keywords: parallel architecture, parallel I/O, pario-bib

Comment: They argue for network-attached devices, and for making I/O devices and networks, instead of CPUs, more the center of architectural design.

coloma:caching:
Kenin Coloma, Alok Choudhary, Wei keng Liao, Lee Ward, Eric Russell, and Neil Pundit. Scalable high-level caching for parallel I/O. In Proceedings of the International Parallel and Distributed Processing Symposium, page 96b, Santa Fe, NM, April 2004. Los Alamitos, CA, USA : IEEE Comput. Soc, 2004.

Abstract: In order for I/O systems to achieve high performance in a parallel environment, they must either sacrifice client-side file caching, or keep caching and deal with complex coherency issues. The most common technique for dealing with cache coherency in multi-client file caching environments uses file locks to bypass the client-side cache. Aside from effectively disabling cache usage, file locking is sometimes unavailable on larger systems.

The high-level abstraction layer of MPI allows us to tackle cache coherency with additional information and coordination without using file locks. By approaching the cache coherency issue further up, the underlying I/O accesses can be modified in such a way as to ensure access to coherent data while satisfying the user's I/O request. We can effectively exploit the benefits of a file system's client-side cache while minimizing its management costs.

Keywords: client-side file caching, file locking, MPI, pario-bib

colvin:vic:
Alex Colvin and Thomas H. Cormen. ViC*: A compiler for virtual-memory C*. In Proceedings of the Third International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS '98), pages 23-33. IEEE Computer Society Press, March 1998.
See also earlier version colvin:vic-tr.

Abstract: This paper describes the functionality of ViC*, a compiler for a variant of the data-parallel language C* with support for out-of-core data. The compiler translates C* programs with shapes declared outofcore, which describe parallel data stored on disk. The compiler output is a SPMD-style program in standard C with I/O and library calls added to efficiently access out-of-core parallel data. The ViC* compiler also applies several program transformations to improve out-of-core data access.

Keywords: compiler, data-parallel programming, programming language, virtual memory, out of core, parallel I/O, pario-bib

colvin:vic-tr:
Alex Colvin and Thomas H. Cormen. ViC*: A compiler for virtual-memory C*. Technical Report PCS-TR97-323, Dept. of Computer Science, Dartmouth College, November 1997.
See also later version colvin:vic.

Abstract: This paper describes the functionality of ViC*, a compiler for a variant of the data-parallel language C* with support for out-of-core data. The compiler translates C* programs with shapes declared outofcore, which describe parallel data stored on disk. The compiler output is a SPMD-style program in standard C with I/O and library calls added to efficiently access out-of-core parallel data. The ViC* compiler also applies several program transformations to improve out-of-core data layout and access.

Keywords: compiler, data-parallel programming, programming language, virtual memory, out of core, parallel I/O, pario-bib

convex:exemplar:
Convex Exemplar Scalable Parallel Processing System. Convex Computer Corporation, 1994. Order number 080-002293-000.

Keywords: parallel computer architecture, shared memory, parallel I/O, pario-bib

Comment: The Convex Exemplar connects hypernodes, which are basically SMP nodes built from 8 HP PA-RISC CPUs, lots of RAM, and a crossbar switch, with their own implementation of the SCI interconnect. Hierarchical caching supports a global shared physical address space. Each hypernode can also have an I/O adapter, to which they can attach lots of different I/O devices. The I/O adapter has the capability to DMA directly into any memory in the system, even on other hypernodes. Each hypernode runs its own file-system server, which manages UNIX file systems on the devices of that hypernode. Striped file systems are supported in software, although it's not clear if they can stripe across hypernodes, or only within hypernodes, ie, whether (striped) file systems can span multiple hypernodes.

convex:stripe:
CONVEX Computer Corporation, Richardson, Texas. CONVEX UNIX Programmer's Manual, Part I, eighth edition, October 1988.

Keywords: parallel I/O, parallel file system, striping, pario-bib

Comment: Implementation of striped disks on the CONVEX. Uses partitions of normal device drivers. Kernel data structure knows about the interleaving granularity, the set of partitions, sizes, etc.

cook:simd-jpeg:
Gregory W. Cook and Edward J. Delp. An investigation of scalable SIMD I/O techniques with application to parallel JPEG compression. Journal of Parallel and Distributed Computing, 30(2):111-128, November 1995.

Keywords: multimedia, compression, data-parallel computing, Maspar, parallel I/O, pario-bib

copeland:bubba:
George Copeland, William Alexander, Ellen Boughter, and Tom Keller. Data placement in Bubba. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 99-108, Chicago, IL, June 1988. ACM Press.

Keywords: parallel I/O, database, disk caching, pario-bib

Comment: A database machine. Experimental/analytical model of a placement algorithm that declusters relations across several parallel, independent disks. The declustering is done on a subset of the disks, and the choices involved are the number of disks to decluster onto, which relations to put where, and whether a relation should be cache-resident. Communications overhead limits the usefulness of declustering in some cases, depending on the workload. See boral:bubba.

corbett:bmpi-overview:
Peter Corbett, Dror Feitelson, Sam Fineberg, Yarsun Hsu, Bill Nitzberg, Jean-Pierre Prost, Marc Snir, Bernard Traversat, and Parkson Wong. Overview of the MPI-IO parallel I/O interface. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 32, pages 477-487. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version corbett:mpi-overview.

Keywords: parallel I/O, multiprocessor file system interface, pario-bib

Comment: Part of jin:io-book, revised version of corbett:mpi-overview.

corbett:bvesta:
Peter F. Corbett and Dror G. Feitelson. The Vesta parallel file system. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 20, pages 285-308. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version corbett:jvesta.

Keywords: multiprocessor file system, Vesta, parallel I/O, pario-bib

Comment: Part of jin:io-book; reformatted version of corbett:jvesta.

corbett:jvesta:
Peter F. Corbett and Dror G. Feitelson. The Vesta parallel file system. ACM Transactions on Computer Systems, 14(3):225-264, August 1996.
See also earlier version corbett:vesta.
See also later version corbett:bvesta.

Abstract: The Vesta parallel file system is designed to provide parallel file access to application programs running on multicomputers with parallel I/O subsystems. Vesta uses a new abstraction of files: a file is not a sequence of bytes, but rather it can be partitioned into multiple disjoint sequences that are accessed in parallel. The partitioning-which can also be changed dynamically-reduces the need for synchronization and coordination during the access. Some control over the layout of data is also provided, so the layout can be matched with the anticipated access patterns. The system is fully implemented and forms the basis for the AIX Parallel I/O File System on the IBM SP2. The implementation does not compromise scalability or parallelism. In fact, all data accesses are done directly to the I/O node that contains the requested data, without any indirection or access to shared metadata. Disk mapping and caching functions are confined to each I/O node, so there is no need to keep data coherent across nodes. Performance measurements shown good scalability with increased resources. Moreover, different access patterns are show to achieve similar performance.

Keywords: multiprocessor file system, Vesta, parallel I/O, pario-bib

Comment: See also corbett:pfs, corbett:vesta*, feitelson:pario. This is the ultimate Vesta reference. There seem to be only a few small things that are completely new over what's been published elsewhere, although this presentation is much more complete and polished.

corbett:mpi-io2:
Peter Corbett, Dror Feitelson, Yarsun Hsu, Jean-Pierre Prost, Marc Snir, Sam Fineberg, Bill Nitzberg, Bernard Traversat, and Parkson Wong. MPI-IO: a parallel file I/O interface for MPI. Technical Report RC 19841 (87784), IBM T.J. Watson Research Center, November 1994. Version 0.2.
See also later version corbett:mpi-io3.

Keywords: parallel I/O, message-passing, multiprocesor file system interface, pario-bib

Comment: Superceded by mpi-ioc:mpi-io5. See the MPI-IO Web page at http://parallel.nas.nasa.gov/MPI-IO/.

corbett:mpi-io3:
Peter Corbett, Dror Feitelson, Yarsun Hsu, Jean-Pierre Prost, Marc Snir, Sam Fineberg, Bill Nitzberg, Bernard Traversat, and Parkson Wong. MPI-IO: a parallel file I/O interface for MPI. Technical Report NAS-95-002, NASA Ames Research Center, January 1995. Version 0.3.
See also earlier version corbett:mpi-io2.
See also later version corbett:mpi-io4.

Keywords: parallel I/O, message-passing, multiprocesor file system interface, pario-bib

Comment: The goal is to design a standard file interface for SPMD message-passing programs. An earlier version of this specification was prost:mpi-io. Superceded by mpi-ioc:mpi-io5. See also the general MPI I/O web page at http://parallel.nas.nasa.gov/MPI-IO/.

corbett:mpi-io4:
Peter Corbett, Yarsun Hsu, Jean-Pierre Prost, Marc Snir, Sam Fineberg, Bill Nitzberg, Bernard Traversat, Parkson Wong, and Dror Feitelson. MPI-IO: a parallel file I/O interface for MPI, December 1995. Version 0.4.
See also earlier version corbett:mpi-io3.
See also later version mpi-ioc:mpi-io5.

Keywords: parallel I/O, message-passing, multiprocesor file system interface, pario-bib

Comment: Superceded by mpi-ioc:mpi-io5. See the MPI-IO Web page at http://parallel.nas.nasa.gov/MPI-IO/.

corbett:mpi-overview:
Peter Corbett, Dror Feitelson, Sam Fineberg, Yarsun Hsu, Bill Nitzberg, Jean-Pierre Prost, Marc Snir, Bernard Traversat, and Parkson Wong. Overview of the MPI-IO parallel I/O interface. In Proceedings of the IPPS '95 Workshop on Input/Output in Parallel and Distributed Systems, pages 1-15, April 1995.
See also later version corbett:mpi-overview-book.

Abstract: Thanks to MPI, writing portable message passing parallel programs is almost a reality. One of the remaining problems is file I/O. Although parallel file systems support similar interfaces, the lack of a standard makes developing a truly portable program impossible. It is not feasible to develop large scientific applications from scratch for each generation of parallel machine, and, in the scientific world, a program is not considered truly portable unless it not only compiles, but also runs efficiently.

The MPI-IO interface is being proposed as an extension to the MPI standard to fill this need. MPI-IO supports a high-level interface to describe the partitioning of file data among processes, a collective interface describing complete transfers of global data structures between process memories and files, asynchronous I/O operations, allowing computation to be overlapped with I/O, and optimization of physical file layout on storage devices (disks).

Keywords: parallel I/O, multiprocessor file system interface, pario-bib

Comment: A more readable explanation of MPI-IO than the proposed-standard document corbett:mpi-io3. See polished book version, corbett:mpi-overview-book. See also the slides presented at IOPADS

corbett:mpi-overview-book:
Peter Corbett, Dror Feitelson, Sam Fineberg, Yarsun Hsu, Bill Nitzberg, Jean-Pierre Prost, Marc Snir, Bernard Traversat, and Parkson Wong. Overview of the MPI-IO parallel I/O interface. In Jain et al. [iopads-book], chapter 5, pages 127-146.
See also earlier version corbett:mpi-overview.

Abstract: Thanks to MPI, writing portable message passing parallel programs is almost a reality. One of the remaining problems is file I/O. Although parallel file systems support similar interfaces, the lack of a standard makes developing a truly portable program impossible. It is not feasible to develop large scientific applications from scratch for each generation of parallel machine, and, in the scientific world, a program is not considered truly portable unless it not only compiles, but also runs efficiently.

The MPI-IO interface is being proposed as an extension to the MPI standard to fill this need. MPI-IO supports a high-level interface to describe the partitioning of file data among processes, a collective interface describing complete transfers of global data structures between process memories and files, asynchronous I/O operations, allowing computation to be overlapped with I/O, and optimization of physical file layout on storage devices (disks).

Keywords: parallel I/O, file system interface, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

corbett:pfs:
Peter F. Corbett, Dror G. Feitelson, Jean-Pierre Prost, George S. Almasi, Sandra Johnson Baylor, Anthony S. Bolmarcich, Yarsun Hsu, Julian Satran, Marc Snir, Robert Colao, Brian Herr, Joseph Kavaky, Thomas R. Morgan, and Anthony Zlotek. Parallel file systems for the IBM SP computers. IBM Systems Journal, 34(2):222-248, January 1995.

Abstract: Parallel computer architectures require innovative software solutions to utilize their capabilities. This statement is true for system software no less than for application programs. File system development for the IBM SP product line of computers started with the Vesta research project, which introduced the ideas of parallel access to partitioned files. This technology was then integrated with a conventional Advanced Interactive Executive (AIX) environment to create the IBM AIX Parallel I/O File System product. We describe the design and implementation of Vesta, including user interfaces and enhancements to the control environment needed to run the system. Changes to the basic design that were made as part of the AIX Parallel I/O File System are identified and justified.

Keywords: parallel file system, parallel I/O, Vesta, pario-bib

Comment: Probably the most authoritative Vesta/PIOFS paper yet. Good description of the system, motivations, etc. Not as much detail as some, like corbett:vesta-di.

corbett:rdp:
Peter Corbett, Bob English, Atul Goel, Tomislav Grcanac, Steven Kleiman, James Leong, and Sunitha Sankar. Row-diagonal parity for double disk failure correction. In Proceedings of the USENIX FAST '04 Conference on File and Storage Technologies, pages 1-14, San Francisco, CA, March 2004. Network Appliance, Inc., USENIX Association.

Abstract: Row-Diagonal Parity (RDP) is a new algorithm for protecting against double disk failures. It stores all data unencoded, and uses only exclusive-or operations to compute parity. RDP is provably optimal in computational complexity, both during construction and reconstruction. Like other algorithms, it is optimal in the amount of redundant information stored and accessed. RDP works within a single stripe of blocks of sizes normally used by file systems, databases and disk arrays. It can be utilized in a fixed (RAID-4) or rotated (RAID-5) parity placement style. It is possible to extend the algorithm to encompass multiple RAID-4 or RAID-5 disk arrays in a single RDP disk array. It is possible to add disks to an existing RDP array without recalculating parity or moving data. Implementation results show that RDP performance can be made nearly equal to single parity RAID-4 and RAID-5 performance.

Keywords: fault tolerance, disk failures, algorithms, row-diagonal parity, RAID, pario-bib

Comment: Awarded best paper.

corbett:sio-api1.0:
Peter F. Corbett, Jean-Pierre Prost, Chris Demetriou, Garth Gibson, Erik Reidel, Jim Zelenka, Yuqun Chen, Ed Felten, Kai Li, John Hartman, Larry Peterson, Brian Bershad, Alec Wolman, and Ruth Aydt. Proposal for a common parallel file system programming interface. WWW http://www.cs.arizona.edu/sio/api1.0.ps, September 1996. Version 1.0.

Keywords: parallel I/O, multiprocessor file system interface, pario-bib

Comment: Specs of the proposed SIO low-level interface for parallel file systems. Key features: linear file model, scatter-gather read and write calls (list of strided segments), asynch versions of all calls, extensive hint system. Naming structure is unspecified; no directories specified. Permissions left out. Some control over client caching and over disk layout. Each file has a (small) 'label', which is just a little space for application-controlled meta data. Optional extensions: collective read and write calls, fast copy.

corbett:user-friendly:
Peter F. Corbett, Dror G. Feitelson, Jean-Pierre Prost, and Marc Snir. User-friendly and efficient parallel I/Os using the Vesta parallel file system. In Transputers '94: Advanced Research and Industrial Applications, pages 23-38. IOS Press, September 1994.

Keywords: multiprocessor file system interface, parallel I/O, Vesta, pario-bib

corbett:vesta:
Peter F. Corbett, Sandra Johnson Baylor, and Dror G. Feitelson. Overview of the Vesta parallel file system. In Proceedings of the IPPS '93 Workshop on Input/Output in Parallel Computer Systems, pages 1-16, Newport Beach, CA, 1993. Also published in Computer Architecture News 21(5), December 1993, pages 7-14.
See also later version corbett:jvesta.

Keywords: parallel I/O, multiprocessor file system, concurrent file checkpointing, multiprocessor file system interface, Vesta, pario-bib

Comment: See corbett:jvesta. Design of a file system for a message-passing MIMD multiprocessor to be used for scientific computing. Separate I/O nodes from compute nodes; I/O nodes and disks are viewed as a data-staging area. File system runs on I/O nodes only. Files declustered by record, among physical partitions, each residing on a separate disk, and each separately growable. Then the user maps logical partitions, one per process, on the file at open time. These are designed to be two-dimensional, so that mapping arrays of various strides and contiguities, with records as the basic unit, is easy. Various consistency and atomicity requirements. File checkpointing, really snapshotting, is built in. No client caching, no redundancy for reliability. See also corbett:vesta2, corbett:vesta3, feitelson:pario.

corbett:vesta-di:
Peter F. Corbett and Dror G. Feitelson. Design and implementation of the Vesta parallel file system. In Proceedings of the Scalable High-Performance Computing Conference, pages 63-70, 1994.
See also later version corbett:jvesta.

Abstract: The Vesta Parallel file system is designed to provide parallel file access to application programs running on multicomputers with parallel I/O subsystems. Vesta uses a new abstraction of files: a file is not a sequence of bytes, but rather it can be partitioned into multiple disjoint sequences that are accessed in parallel. The partitioning - which can also be changed dynamically - reduces the need for synchronization and coordiantion during the access. Some control over the layout of data is also provided, so the layout can be matched with the anticipated access patterns. The system is fully implemented, and is beginning to be used by application programmers. The implementation does not compromise scalability or parallelism. In fact, all data accesses are done directly to the I/O node that contains the requested data, without any indirection or access to shared metadata. There are no centralized control points in the system.

Keywords: parallel I/O, multiprocessor file system, file system interface, Vesta, pario-bib

Comment: See corbett:jvesta and corbett:vesta* for other background. Note that since this paper they have put Vesta on top of a raw disk (using 64 KB blocks) rather than on top of AIX-JFS. They describe here the structure of Vesta (2-d files, cells, subfiles, etc), the ordering of bytes within a subfile, hashing of the file name to find the file metadata, Xrefs instead of directories, caching, asynchronous I/O, prefetching, shared file pointers, concurrency control, and block-list structure. Many things, some visible to the user and some not, are new.

corbett:vesta-man:
Peter F. Corbett and Dror G. Feitelson. Vesta file system programmer's reference. Technical Report Research Report RC 19898 (88058), IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, October 1994. Version 1.01.

Keywords: multiprocessor file system, parallel I/O, Vesta, pario-bib

Comment: Complete user's manual of the Vesta file system. Impressive in its completeness (e.g., it has user quotas). Handy for its detailed description of the interface, but doesn't say much (of course) about the implementation.

corbett:vesta2:
Peter F. Corbett, Dror G. Feitelson, Jean-Pierre Prost, and Sandra Johnson Baylor. Parallel access to files in the Vesta file system. In Proceedings of Supercomputing '93, pages 472-481, Portland, OR, 1993. IEEE Computer Society Press.
See also later version corbett:jvesta.

Keywords: multiprocessor file system, file checkpointing, Vesta, pario-bib

Comment: See also corbett:jvesta, corbett:vesta, corbett:vesta3, feitelson:pario. A new abstraction and a new interface. Typical systems use transparent striping, and access modes. They believe that ``optimization requires control''. Need to be able to tell the system what you want. User-defined or default. Asynch I/O. Concurrency control. Checkpointing. Export/import to external storage. New abstraction: file is multiple sequences of records. Each process sees a logical partition of the file. Physical partition is one or more disks. Logical partition defined in terms of records. Can repartition without moving data. Rectilinear decompositions of file data to processors. They can do gather/scatter requests. Using logical partitions give system the knowledge that user's accesses are disjoint. Collective operations with consistency checks, vs. independent access. Collective open defines logical view, then synch, then check that partitions are disjoint. If not, then they have access modes to define semantics (more or less the same as other systems). Consider this a target for HPF, etc. Physical partitioning (record size and number of partitions) is defined at create time. Can they have different physical or logical partition sizes in the same file? Future: parallel pipelines, ``out-of-core'' backing store for HPF arrays, high-level operations, collective operations.

cormen:bmmc:
Thomas H. Cormen and Leonard F. Wisniewski. Asymptotically tight bounds for performing BMMC permutations on parallel disk systems. In Proceedings of the Fifth Symposium on Parallel Algorithms and Architectures, pages 130-139. ACM Press, June 1993.
See also later version cormen:bmmc-tr.

Keywords: parallel I/O, algorithm, pario-bib

Comment: Earlier version available as Dartmouth tech report PCS-TR93-193. But the most recent and complete version is Dartmouth PCS-TR94-223, cormen:bmmc-tr.

cormen:bmmc-tr:
Thomas H. Cormen, Thomas Sundquist, and Leonard F. Wisniewski. Asymptotically tight bounds for performing BMMC permutations on parallel disk systems. Technical Report PCS-TR94-223, Dept. of Computer Science, Dartmouth College, July 1994. Preliminary version also appeared in Proceedings of the Fifth Symposium on Parallel Algorithms and Architectures.
See also earlier version cormen:bmmc.

Keywords: parallel I/O algorithms, pario-bib

Comment: Supercedes cormen:bmmc.

cormen:early-vic:
Thomas H. Cormen and Melissa Hirschl. Early experiences in evaluating the parallel disk model with the ViC* implementation. Parallel Computing, 23(4):571-600, June 1997.
See also earlier version cormen:early-vic-tr.

Abstract: Although several algorithms have been developed for the Parallel Disk Model (PDM), few have been implemented. Consequently, little has been known about the accuracy of the PDM in measuring I/O time and total running time to perform an out-of-core computation. This paper analyzes timing results on multiple-disk platforms for two PDM algorithms, out-of-core radix sort and BMMC permutations, to determine the strengths and weaknesses of the PDM.

The results indicate the following. First, good PDM algorithms are usually not I/O bound. Second, of the four PDM parameters, one (problem size) is a good indicator of I/O time and running time, one (memory size) is a good indicator of I/O time but not necessarily running time, and the other two (block size and number of disks) do not necessarily indicate either I/O or running time. Third, because PDM algorithms tend not to be I/O bound, using asynchronous I/O can reduce I/O wait times significantly.

The software interface to the PDM is part of the ViC* run-time library. The interface is a set of wrappers that are designed to be both efficient and portable across several underlying file systems and target machines.

Keywords: parallel I/O, parallel I/O algorithm, compiler, pario-bib

cormen:early-vic-tr:
Thomas H. Cormen and Melissa Hirschl. Early experiences in evaluating the parallel disk model with the ViC* implementation. Technical Report PCS-TR96-293, Dept. of Computer Science, Dartmouth College, August 1996.
See also later version cormen:early-vic.

Abstract: Although several algorithms have been developed for the Parallel Disk Model (PDM), few have been implemented. Consequently, little has been known about the accuracy of the PDM in measuring I/O time and total time to perform an out-of-core computation. This paper analyzes timing results on a uniprocessor with several disks for two PDM algorithms, out-of-core radix sort and BMMC permutations, to determine the strengths and weaknesses of the PDM.

The results indicate the following. First, good PDM algorithms are usually not I/O bound. Second, of the four PDM parameters, two (problem size and memory size) are good indicators of I/O time and running time, but the other two (block size and number of disks) are not. Third, because PDM algorithms tend not to be I/O bound, asynchronous I/O effectively hides I/O times.

The software interface to the PDM is part of the ViC* run-time library. The interface is a set of wrappers that are designed to be both efficient and portable across several parallel file systems and target machines.

Keywords: parallel I/O, parallel I/O algorithm, compiler, pario-bib

Comment: This used to be called cormen:early-vic but I renamed it because the paper will appear in parcomp.

cormen:fft:
Thomas H. Cormen and David M. Nicol. Performing out-of-core FFTs on parallel disk systems. Parallel Computing, 24(1):5-20, January 1998.
See also earlier version cormen:fft-tr.

Abstract: The fast Fourier transform (FFT) plays a key role in many areas of computational science and engineering. Although most one-dimensional FFT problems can be solved entirely in main memory, some important classes of applications require out-of-core techniques. For these, use of parallel I/O systems can improve performance considerably. This paper shows how to perform one-dimensional FFTs using a parallel disk system with independent disk accesses. We present both analytical and experimental results for performing out-of-core FFTs in two ways: using traditional virtual memory with demand paging, and using a provably asymptotically optimal algorithm for the parallel disk model (PDM) of I.S. Vitter and E.A.M. Shriver (1994). When run on a DEC 2100 server with a large memory and eight parallel disks, the optimal algorithm for the PDM runs up to 144.7 times faster than in-core methods under demand paging. Moreover, even including I/O costs, the normalized times for the optimal PDM algorithm are competitive, or better than, those for in-core methods even when they run entirely in memory.

Keywords: parallel I/O, out of core, scientific computing, FFT, pario-bib

Comment: see also cormen:fft2 and cormen:fft3. Part of a special issue.

cormen:fft-tr:
Thomas H. Cormen and David M. Nicol. Performing out-of-core FFTs on parallel disk systems. Technical Report PCS-TR96-294, Dept. of Computer Science, Dartmouth College, 1996.
See also later version cormen:fft.

Abstract: The Fast Fourier Transform (FFT) plays a key role in many areas of computational science and engineering. Although most one-dimensional FFT problems can be solved entirely in main memory, some important classes of applications require out-of-core techniques. For these, use of parallel I/O systems can improve performance considerably. This paper shows how to perform one-dimensional FFTs using a parallel disk system with independent disk accesses. We present both analytical and experimental results for performing out-of-core FFTs in two ways: using traditional virtual memory with demand paging, and using a provably asymptotically optimal algorithm for the Parallel Disk Model (PDM) of Vitter and Shriver. When run on a DEC 2100 server with a large memory and eight parallel disks, the optimal algorithm for the PDM runs up to 144.7 times faster than in-core methods under demand paging. Moreover, even including I/O costs, the normalized times for the optimal PDM algorithm are competitive, or better than, those for in-core methods even when they run entirely in memory.

Keywords: parallel I/O, out of core, scientific computing, FFT, pario-bib

cormen:fft2-tr:
Thomas H. Cormen, Jake Wegmann, and David M. Nicol. Multiprocessor out-of-core FFTs with distributed memory and parallel disks. Technical Report PCS-TR97-303, Dept. of Computer Science, Dartmouth College, 1997.
See also later version cormen:fft3.

Abstract: This paper extends an earlier out-of-core Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four out-of-core multiprocessor methods are examined. Operationally, these methods differ in the size of "mini-butterfly" computed in memory and how the data are organized on the disks and in the distributed memory of the multiprocessor. The methods also perform differing amounts of I/O and communication. Two of them have the remarkable property that even though they are computing the FFT on a multiprocessor, all interprocessor communication occurs outside the mini-butterfly computations. Performance results on a small workstation cluster indicate that except for unusual combinations of problem size and memory size, the methods that do not perform interprocessor communication during the mini-butterfly computations require approximately 86% of the time of those that do. Moreover, the faster methods are much easier to implement.

Keywords: parallel I/O, out of core, scientific computing, FFT, pario-bib

Comment: Extends the work of cormen:fft.

cormen:fft3:
Thomas H. Cormen, Jake Wegmann, and David M. Nicol. Multiprocessor out-of-core FFTs with distributed memory and parallel disks. In Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed Systems, pages 68-78, San Jose, CA, November 1997. ACM Press.

Abstract: This paper extends an earlier out-of-core Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four out-of-core multiprocessor methods are examined. Operationally, these methods differ in the size of "mini-butterfly" computed in memory and how the data are organized on the disks and in the distributed memory of the multiprocessor. The methods also perform differing amounts of I/O and communication. Two of them have the remarkable property that even though they are computing the FFT on a multiprocessor, all interprocessor communication occurs outside the mini-butterfly computations; communication that ordinarily occurs in a butterfly is folded into other data-movement operations. An analysis program shows that the two methods that use no butterfly communication usually use less communication overall than the other methods. The analysis program is fast enough that it can be invoked at run time to determine which of the four methods uses the least communication. One set of performance results on a small workstation cluster indicates that the methods without butterfly communication are approximately 9.5% faster. Moreover, they are much easier to implement.

Keywords: out of core, parallel I/O, pario-bib

Comment: They find a way to move the interprocessor communication involved in the out-of-core FFT into a single BMMC permutation between "super-levels", where each super-level involves log(M) stages of the FFT. This usually leads to less communication and to better overall performance. See also cormen:fft and cormen:fft2.

cormen:fg:
Thomas H. Cormen and Elena R. Davidson. FG: a framework generator for hiding latency in parallel programs running on clusters. In AA Bader, DA; Khokhar, editor, Proceedings of the 17th IASTED International Conference on Parallel and Distributed Computing and Systems, pages 137-144, San Francisco, CA, September 2004. INTERNATIONAL SOCIETY COMPUTER S & THEIR APPLICATIONS (ISCA).

Abstract: FG is a programming environment for asynchronous programs that run on clusters and fit into a pipeline framework. It enables the programmer to write a series of synchronous functions and represents them as stages of an asynchronous pipeline. FG mitigates the high latency inherent in interprocessor communication and accessing the outer levels of the memory hierarchy. It overlaps separate pipeline stages that perform communication, computation, and I/O by running the stages asynchronously. Each stage maps to a thread. Buffers, whose sizes correspond to block sizes in the memory hierarchy, traverse the pipeline. FG makes such pipeline-structured parallel programs easier to write, smaller, and faster. FG offers several advantages over statically scheduled overlapping and dynamically scheduled overlapping via explicit calls to thread functions. First, it reduces coding and debugging time. Second, we find that it reduces code size by approximately 15-26%. Third, according to experimental results, it improves performance. Compared with programs that use static scheduling, FG-generated programs run approximately 61-69% faster on a 16-node Beowulf cluster. Compared with programs that make explicit calls for dynamically scheduled threads, FG-generated programs run slightly faster. Fourth, FG offers various design options and makes it easy for the programmer to explore different pipeline configurations.

Keywords: asynchronous I/O, pipelined I/O, pario-bib

cormen:integrate:
Thomas H. Cormen and David Kotz. Integrating theory and practice in parallel file systems. In Proceedings of the 1993 DAGS/PC Symposium, pages 64-74, Hanover, NH, June 1993. Dartmouth Institute for Advanced Graduate Studies. Revised as Dartmouth PCS-TR93-188 on 9/20/94.
See also later version cormen:integrate-tr.

Abstract: Several algorithms for parallel disk systems have appeared in the literature recently, and they are asymptotically optimal in terms of the number of disk accesses. Scalable systems with parallel disks must be able to run these algorithms. We present for the first time a list of capabilities that must be provided by the system to support these optimal algorithms: control over declustering, querying about the configuration, independent I/O, and turning off parity, file caching, and prefetching. We summarize recent theoretical and empirical work that justifies the need for these capabilities. In addition, we sketch an organization for a parallel file interface with low-level primitives and higher-level operations.

Keywords: parallel I/O, multiprocessor file systems, algorithm, file system interface, dfk, pario-bib

Comment: Describing the file system capabilities needed by parallel I/O algorithms to effectively use a parallel disk system. Revised as Dartmouth PCS-TR93-188 (updated).

cormen:integrate-tr:
Thomas H. Cormen and David Kotz. Integrating theory and practice in parallel file systems. Technical Report PCS-TR93-188, Dept. of Math and Computer Science, Dartmouth College, March 1993. Revised 9/20/94.
See also earlier version cormen:integrate.

Abstract: Several algorithms for parallel disk systems have appeared in the literature recently, and they are asymptotically optimal in terms of the number of disk accesses. Scalable systems with parallel disks must be able to run these algorithms. We present a list of capabilities that must be provided by the system to support these optimal algorithms: control over declustering, querying about the configuration, independent I/O, turning off file caching and prefetching, and bypassing parity. We summarize recent theoretical and empirical work that justifies the need for these capabilities.

Keywords: parallel I/O, multiprocessor file systems, algorithm, file system interface, dfk, pario-bib

Comment: Describing the file system capabilities needed by parallel I/O algorithms to effectively use a parallel disk system. Cite cormen:integrate.

cormen:jbmmc:
T. H. Cormen, T. Sundquist, and L. F. Wisniewski. Asymptotically tight bounds for performing BMMC permutations on parallel disk systems. SIAM Journal on Computing, 28(1):105-136, 1998.

Abstract: This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by J.S. Vitter and E.A.M. Shriver (1994). A BMMC permutation maps a source index to a target index by an affine transformation over GF(2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bit-reversal permutations, vector-reversal permutations, hypercube permutations, matrix reblocking, Gray-code permutations, and inverse Gray-code permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bit-permute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linear-algebra techniques to factor the characteristic matrix for the BMMC permutation into a product of factors, each of which characterizes a permutation that can be performed in one pass over the data. The factoring uses new subclasses of BMMC permutations: memoryload-dispersal (MLD) permutations and their inverses. These subclasses extend the catalog of one-pass permutations. Although many BMMC permutations of practical interest fall into subclasses that might be explicitly invoked within the source code, this paper shows how to quickly detect whether a given vector of target addresses specifies a BMMC permutation. Thus, one can determine efficiently at run time whether a permutation to be performed is BMMC and then avoid the general-permutation algorithm and save parallel I/Os by using the BMMC permutation algorithm herein

Keywords: parallel I/O, parallel I/O algorithms, pario-bib

cormen:oocfft:
Thomas H. Cormen and David .M. Nicol. Out-of-core FFTs with parallel disks. ACM SIGMETRICS Performance Evaluation Review, 25(3):3-12, December 1997.

Keywords: scientific computing, out-of-core computation, parallel I/O, pario-bib

Comment: Part of a special issue on parallel and distributed I/O.

cormen:permute:
Thomas H. Cormen. Fast permuting on disk arrays. Journal of Parallel and Distributed Computing, 17(1-2):41-57, January and February 1993.

Keywords: parallel I/O algorithm, pario-bib

Comment: See also cormen:thesis.

cormen:thesis:
Thomas H. Cormen. Virtual Memory for Data-Parallel Computing. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 1992.

Keywords: parallel I/O, algorithm, pario-bib

Comment: Lots of algorithms for out-of-core permutation problems. See also cormen:permute, cormen:integrate.

cormen:vic:
Thomas H. Cormen and Alex Colvin. ViC*: A preprocessor for virtual-memory C*. Technical Report PCS-TR94-243, Dept. of Computer Science, Dartmouth College, November 1994.

Abstract: This paper describes the functionality of ViC*, a compiler-like preprocessor for out-of-core C*. The input to ViC* is a C* program but with certain shapes declared \verb`outofcore`, which means that all parallel variables of these shapes reside on disk. The output is a standard C* program with the appropriate I/O and library calls added for efficient access to out-of-core parallel variables.

Keywords: compiler, out-of-core computation, parallel I/O, pario-bib

correa:out-of-core:
Wagner. T. Corrêa, James T. Klosowski, and Cláudio T. Silva. Out-of-core sort-first parallel rendering for cluster-based tiled displays.. Parallel Computing, 29(3):325-338, March 2003.

Abstract: We present a sort-first parallel system for out-of-core rendering of large models on cluster-based tiled displays. The system renders high-resolution images of large models at interactive frame rates using off-the-shelf PCs with small memory. Given a model, we use an out-of-core preprocessing algorithm to build an on-disk hierarchical representation for the model. At run time, each PC renders the image for a display tile, using an out-of-core rendering approach that employs multiple threads to overlap rendering, visibility computation, and disk operations. The system can operate in approximate mode for real-time rendering, or in conservative mode for rendering with guaranteed accuracy. Running our system in approximate mode on a cluster of 16 PCs each with 512 MB of main memory, we are able to render 12-megapixel images of a 13-million-triangle model with 99.3% of accuracy at 10.8 frames per second. Rendering such a large model at high resolutions and interactive frame rates would typically require expensive high-end graphics hardware. Our results show that a cluster of inexpensive PCs is an attractive alternative to those high-end systems. (36 refs.)

Keywords: cluster based tiled displays, out-of-core rendering, sort first parallel rendering, pario-bib

cortes:bcooperative:
Toni Cortes, Sergi Girona, and Jesús Labarta. Design issues of a cooperative cache with no coherence problems. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 18, pages 259-270. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version cortes:cooperative.

Keywords: cooperative caching, distributed file system, parallel I/O, pario-bib

Comment: Part of jin:io-book; reformatted version of cortes:cooperative.

cortes:bookchap:
Toni Cortes. Software RAID and parallel filesystems. In Rajkumar Buyya, editor, High Peformance Cluster Computing, pages 463-496. Prentice Hall PTR, 1999.

Keywords: parallel file system, RAID, cluster computing, parallel I/O, pario-bib

cortes:cooperative:
Toni Cortes, Sergi Girona, and Jesús Labarta. Design issues of a cooperative cache with no coherence problems. In Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed Systems, pages 37-46, San Jose, CA, November 1997. ACM Press.
See also later version cortes:bcooperative.

Abstract: In this paper, we examine some of the important problems observed in the design of cooperative caches. Solutions to the coherence, load-balancing and fault-tolerance problems are presented. These solutions have been implemented as a part of PAFS, a parallel/distributed file system, and its performance has been compared to the one achieved by xFS. Using the comparison results, we have observed that the proposed ideas not only solve the main problems of cooperative caches, but also increase the overall system performance. Although the solutions presented in this paper were targeted to a parallel machine, reasonable good results have also been obtained for networks of workstations.

Keywords: cooperative caching, distributed file system, parallel I/O, pario-bib

Comment: They make the claim that it is better not to replicate data into local client caches, rather, it is better to simply make remote read and write requests to the cached block in whatever memory it may be. That reduces the overhead (space and time) of replication and coherency, and leads to better performance. They also present a range of parity-based fault-tolerance mechanisms, and a load-balancing technique that reassigns cache buffers to cache-manager processes.

cortes:hetero2:
Tony Cortes and Jesús Labarta. Taking advantage of heterogeneity in disk arrays. Journal of Parallel and Distributed Computing, 63(4):448-464, April 2003.

Abstract: Disk arrays, or RAIDs, have become the solution to increase the capacity and bandwidth of most storage system, but their usage has some limitations because all the disks in the array have to be equal. Nowadays, assuming a homogeneous set of disks to build an array is becoming a not very realistic assumption in many environments, especially in low-cost clusters of workstations. It is difficult to find a disk with the same characteristics as the ones in the array and replacing or adding new disks breaks the homogeneity. In this paper, we propose two block-distribution algorithms (one for RAID0 and an extension for RAID5) that can be used to build disk arrays from a heterogeneous set of disks. We also show that arrays using this algorithm are able to serve many more disk requests per second than when blocks are distributed assuming that all disks have the lowest common speed, which is the solution currently being used. (C) 2003 Elsevier Science (USA). All rights reserved.

Keywords: AdaptRaid, block distribution, heterogeneity; RAID, parallel I/O, pario-bib

cortes:heterogeneity:
T. Cortes and J. Labarta. Extending heterogeneity to RAID level 5. In Proceedings of the 2001 USENIX Technical Conference, pages 119-132, Boston, June 2001. USENIX Association.

Abstract: RAIDs level 5 are one of the most widely used kind of disk array, but their usage has some limitations because all the disks in the array have to be equal. Nowadays, assuming a homogeneous set of disks to build an array is becoming a not very realistic assumption in many environments, especially in low-cost clusters of workstations. It is difficult to and a disk with the same characteristics as the ones in the array and replacing or adding new disks breaks the homogeneity. In this paper, we propose a block- distribution algorithm that can be used to build disk arrays from a heterogeneous set of disks. We also show that arrays using this algorithm are able to serve many more disk requests per second than when blocks are distributed assuming that all disks have the lowest common speed, which is the solution currently being used.

Keywords: parallel I/O, RAID, pario-bib

Comment: The web page for the project is http://people.ac.upc.es/toni/AdaptRaid.html

cortes:heterogeneous:
T. Cortes and J. Labarta. A case for heterogenenous disk arrays. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster'2000), pages 319-325. IEEE Computer Society Press, November 2000.

Abstract: Heterogeneous disk arrays are becoming a common configuration in many sites and specially in storage area networks (SAN). As new disks have different characteristics than old ones, adding new disks or replacing old ones ends up in a heterogeneous disk array. Current solutions to this kind of arrays do not take advantage of the improved characteristics of the new disks. In this paper, we present a block-distribution algorithm that takes advantage of these new characteristics and thus improves the performance and capacity of heterogeneous disk arrays compared to current solutions.

Keywords: disk array, parallel I/O, pario-bib

Comment: The technical report associated with this paper can be found at ftp://ftp.ac.upc.es/pub/reports/DAC/2000/UPC-DAC-2000-76.ps.Z

cortes:hraid:
Toni Cortes and Jesús Labarta. HRaid: A flexible storage-system simulator. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, pages 772-778. CSREA Press, June 1999.

Abstract: Clusters of workstations are becoming a quite popular platform to run high-performance applications. This fact has stressed the need of high-performance storage systems for this kind of environments. In order to design such systems, we need adequate tools, which should be flexible enough to model a cluster of workstations. Currently available simulators do not allow heterogeneity (several kind of disks), hierarchies or resource sharing (among others), which are quite common in clusters. To fill this gap, we have designed and implemented HRaid, which is a very flexible and easy to use storage-system simulator. In this paper, we present this simulator, its main abstractions and some simple examples of how it can be used.

Keywords: simulation, RAID, disk array, storage system, heterogeneous system, parallel I/O, pario-bib

cortes:lessons:
Toni Cortes. Parallel I/O: lessons learnt in the last 20 years. In Proceedings of the 2004 IEEE International Conference on Cluster Computing, San Diego, CA, September 2004. Piscataway, NJ, USA : IEEE, 2004.

Abstract: Summary form only given. After these two decades, it is now a good time to go through all the done work and try to learn the important lessons all these parallel I/O initiatives have taught us. This paper aims at giving this global overview. The focus is not on commercial/academic systems/prototypes, but on the concepts that lay behind them. These concepts have normally been applied at different levels, and thus, such an overview can be of interest to many people ranging from the hardware design to the application implementation. Some of the most important concepts that are discussed are, among others, data placement (RAIDs, 2D and 3D files, ...), network architectures for parallel I/O (Network attached devices, SAN, ...), parallel caching and prefetching (cooperative caching, Informed caching and prefetching, ...), and interfaces (collective I/O, data distribution interfaces, ...).

Keywords: tutorial, parallel I/O overview, pario-bib

Comment: Tutorial given at Cluster 2004.

cortes:paca:
Toni Cortes, Sergi Girona, and Jesús Labarta. PACA: A cooperative file system cache for parallel machines. In Proceedings of the 2nd International Euro-Par Conference, pages I:477-486, August 1996.
See also earlier version cortes:paca-tr.

Abstract: A new cooperative caching mechanism, PACA, along with a caching algorithm, LRU-Interleaved, and an aggressive prefetching algorithm, Full-File-On-Open, are presented. The caching algorithm is especially targeted to parallel machines running a microkernel-based operating system. It avoids the cache coherence problem with no loss in performance. Comparing our algorithm with N-Chance Forwarding, in the above environment, better results have been obtained by LRU-Interleaved. We also evaluate an aggressive prefetching algorithm that highly increases read performance taking advantage of the huge caches cooperative caching offers.

Keywords: file caching, multiprocessor file system, cooperative caching, parallel I/O, pario-bib

Comment: Contact toni@ac.upc.es. See also the a longer version of the paper, cortes:paca-tr.

cortes:paca-tr:
Toni Cortes, Sergi Girona, and Jesús Labarta. PACA: A cooperative file system cache for parallel machines. Technical Report 96-07, UPC-CEPBA, 1996.
See also later version cortes:paca.

Keywords: file caching, multiprocessor file system, cooperative caching, parallel I/O, pario-bib

Comment: See cortes:paca.

cortes:pafs:
Toni Cortes, Sergi Girona, and Jesús Labarta. Avoiding the cache-coherence problem in a parallel/distributed file system. In Proceedings of High-Performance Computing and Networking, pages 860-869, April 1997.
See also later version cortes:pafs2.

Abstract: In this paper we describe PAFS, a new parallel/distributed file system. Within the whole file system, special interest is placed on the caching mechanism. We present a cooperative cache that has the advantages of cooperation and avoids the problems derived from the coherence mechanisms. Furthermore, this has been achieved with a reasonable gain in performance. In order to show the obtained performance, we present a comparison between PAFS and xFS (a file system that also implements a cooperative cache).

Keywords: file caching, multiprocessor file system, cooperative caching, cache coherence, parallel I/O, pario-bib

Comment: Contact toni@ac.upc.es.

cortes:pafs2:
Toni Cortes, Sergi Girona, and Jesús Labarta. Avoiding the cache-coherence problem in a parallel/distributed file system. Technical Report UPC-CEPBA-1996-13, UPC-CEPBA, May 1997.
See also earlier version cortes:pafs.

Abstract: In this paper we present PAFS, a new parallel/distributed file system. Within the whole file system, special interest is placed on the caching and prefetching mechanisms. We present a cooperative cache that avoids the coherence problem while it continues to be highly scalable and achieves very good performance. We also present an aggressive prefetching algorithm that allows full utilization of the big caches offered by the cooperative cache mechanism. All the results presented in this paper have obtained through simulation using the Sprite workload.

Keywords: file caching, multiprocessor file system, cooperative caching, cache coherence, parallel I/O, pario-bib

Comment: A longer, more detailed version of cortes:pafs.

cortes:prefetch:
T. Cortes and J. Labarta. Linear aggressive prefetching: A way to increase the performance of cooperative caches. In Proceedings of the Joint International Parallel Processing Symposium and IEEE Symposium on Parallel and Distributed Processing, pages 45-54, San Juan, Puerto Rico, April 1999.

Abstract: Cooperative caches offer huge amounts of caching memory that is not always used as well as it could be. We might find blocks in the cache that have not been requested for many hours. These blocks will hardly improve the performance of the system while the buffers they occupy could be better used to speed-up the I/O operations. In this paper, we present a family of simple prefetching algorithms that increase the file-system performance significantly. Furthermore, we also present a way to make any simple prefetching algorithm into an aggressive one that controls its aggressiveness not to flood the cache unnecessarily. All these algorithms and mechanisms have proven to increase the performance of two state-of-the-art parallel/distributed file systems: PAFS and xFS.

Keywords: parallel I/O, file access pattern, prefetching, caching, simulation, pario-bib

Comment: They present algorithms for "linear aggressive prefetching" for systems using a cooperative cache. Two prediction schemes are used: a OBA (one block ahead) and IS_PPM (Interval and size -prediction by partial match). The aggressive prefetch algorithm continuosly prefetches data until a miss-prediction occurs. When a mis-prediction occurs, they realize that they were on the wrong path and start prefetching again from the mis-predicted block. To limit the aggressiveness of the prefetching, they only allow one block from each file to be prefetched at a time. If a single application is running, this forces a parallel reads to only utilize one disk at a time. They claim, however, that when many files are being accessed they achieve good disk utilization. They implemented the prefetching algorithms on the xFS anderson:serverless and PAFS cortes:pafs file systems. They used a trace-driven simulator DIMEMAS labarta:dip to obtain performance results for portions of the CHARISMA and Sprite workloads. The results show that using aggressive prefetching does not usually load the system more than a system with no prefetching, and sometimes, it even lowers the disk traffic.

cortes:thesis:
Toni Cortes. Cooperative Caching and Prefetching in Parallel/Distributed File Systems. PhD thesis, UPC: Universitat Politècnica de Catalunya, Barcelona, Spain, 1997.

Keywords: parallel I/O, file access pattern, prefetching, caching, pario-bib

courtright:backward:
William V. Courtright II and Garth A. Gibson. Backward error recovery in redundant disk arrays. In Proceedings of the Twentieth International Conference for the Resource Management and Performance Evaluation of Enterprise Computing Systems (CMG), pages 63-74, December 1994.
See also earlier version courtright:backward-tr.

Abstract: Redundant disk arrays are single fault tolerant, incorporating a layer of error handling not found in nonredundant disk systems. Recovery from these errors is complex, due in part to the large number of erroneous states the system may reach. The established approach to error recovery in disk systems is to transition directly from an erroneous state to completion. This technique, known as forward error recovery, relies upon the context in which an error occurs to determine the steps required to reach completion, which implies forward error recovery is design specific. Forward error recovery requires the enumeration of all erroneous states the system may reach and the construction of a forward path from each erroneous state. We propose a method of error recovery which does not rely upon the enumeration of erroneous states or the context in which errors occur. When an error is encountered, we advocate mechanized recovery to an error-free state from which an operation may be retried. Using a form of backward error recovery, we are able to manage the complexity of error recovery in redundant disk arrays without sacrificing performance.

Keywords: parallel I/O, disk array, RAID, redundancy, reliability, recovery, pario-bib

Comment: Also available in HTML format at http://www.cs.cmu.edu/Web/Groups/PDL/HTML-Papers/CMG94/c.fm.html.

courtright:backward-tr:
William V. Courtright II and Garth A. Gibson. Backward error recovery in redundant disk arrays. Technical Report CMU-CS-94-193, Carnegie Mellon University, September 1994.
See also later version courtright:backward.

Keywords: parallel I/O, disk array, RAID, redundancy, reliability, recovery, pario-bib

courtright:raidframe:
William V. Courtright II, Garth A. Gibson, Mark Holland, and Jim Zelenka. RAIDframe: rapid prototyping for disk arrays. In Proceedings of the 1996 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 268-269, Philadelphia, PA, May 1996. ACM Press. Poster paper.
See also earlier version gibson:raidframe-tr.

Keywords: parallel I/O, RAID, disk array, reliability, simulation, pario-bib

Comment: See expanded version gibson:raidframe-tr.

coyne:hpss:
Robert A. Coyne, Harry Hulen, and Richard Watson. The high performance storage system. In Proceedings of Supercomputing '93, pages 83-92, Portland, OR, 1993. IEEE Computer Society Press.

Keywords: parallel I/O, file system, network, pario-bib

Comment: See also coyne:storage.

coyne:storage:
Robert A. Coyne, Harry Hulen, and Richard Watson. Storage systems for national information assets, 1993. Publication status unknown.

Keywords: parallel I/O, file system, network, pario-bib

Comment: See also coyne:hpss. They describe the National Storage Laboratory at LLNL. Collaboration with many companies. The idea is to build a combined storage system from many disk and tape components that is networked to supercomputers. The philosophy is to separate control and data network traffic, so that the overall control can be managed by a (relatively) small computer, without the same computer needing to pump all of the data through it's CPU. The data would go directly from the devices to the client supercomputer. They also want to support multiple hierarchies of data storage, so that new technologies can be inserted without disrupting existing hierarchies. Access interface is layered so that high-level abstractions can be provided as well as low-level control for those who need it.

cozette:read2:
Olivier Cozette, Cyril Randriamaro, and Gil Utard. READ^2: Put disks at network level. In Workshop on Parallel I/O in Cluster Computing and Computational Grids, pages 698-704, Tokyo, May 2003. IEEE Computer Society Press. Organized at the IEEE/ACM International Symposium on Cluster Computing and the Grid 2003.

Abstract: Grand challenge applications have to process large amounts of data, and then require high performance IO systems. Cluster computing is a good alternative to proprietary system for building cost effective IO intensive platform: some cluster architectures won sorting benchmark (MinuteSort, Datamation)! Recent advances in IO component technologies (disk, controller and network) let us expect higher IO performance for data intensive applications on cluster. The counterpart of this evolution is that much stress is put on the different buses (memory, IO) of each node which cannot be scaled. In this paper we investigate a strategy we called READ2 (Remote Efficient Access to Distant Device) to reduce the stress. With READ2 any cluster node accesses directly to remote disk: the remote processor and the remote memory are removed from the control and data path: Inputs/Outputs don't interfere with the host processor and the host memory activity. With READ2 strategy, a cluster can be considered as a shared disk architecture instead of a shared nothing one. This papers describes an implementation of READ^2 on Myrinet Networks. First experimental results show IO performance improvement.

Keywords: parallel I/O, pario-bib

crandall:iochar:
Phyllis E. Crandall, Ruth A. Aydt, Andrew A. Chien, and Daniel A. Reed. Input/output characteristics of scalable parallel applications. In Proceedings of Supercomputing '95, San Diego, CA, December 1995. IEEE Computer Society Press.

Abstract: Rapid increases in computing and communication performance are exacerbating the long-standing problem of performance-limited input/output. Indeed, for many otherwise scalable parallel applications, input/output is emerging as a major performance bottleneck. The design of scalable input/output systems depends critically on the input/output requirements and access patterns for this emerging class of large-scale parallel applications. However, hard data on the behavior of such applications is only now becoming available. In this paper, we describe the input/output requirements of three scalable parallel applications (electron scattering, terrain rendering, and quantum chemistry) on the Intel Paragon XP/S. As part of an ongoing parallel input/output characterization effort, we used instrumented versions of the application codes to capture and analyze input/output volume, request size distributions, and temporal request structure. Because complete traces of individual application input/output requests were captured, in-depth, off-line analyses were possible. In addition, we conducted informal interviews of the application developers to understand the relation between the codes' current and desired input/output structure. The results of our studies show a wide variety of temporal and spatial access patterns, including highly read-intensive and write-intensive phases, extremely large and extremely small request sizes, and both sequential and highly irregular access patterns. We conclude with a discussion of the broad spectrum of access patterns and their profound implications for parallel file caching and prefetching schemes.

Keywords: file access pattern, file system workload, workload characterization, parallel I/O, pario-bib

Comment: They use the Pablo instrumentation and analysis tools to instrument three scalable applications that use heavy I/O: electron scattering, terrain rendering, and quantum chemistry. They look at the volume of data moved, the timing of I/O, and the periodic nature of I/O. They do a little bit with the access patterns of data within each file. They found a HUGE variation in request sizes, amount of I/O, number of files, and so forth. Their primary conclusion is thus that file systems should be adaptable to different access patterns, preferably under control of the application. Note proceedings only available on CD-ROM or WWW.

crauser:segment:
A. Crauser, P. Ferragina, K. Mehlhorn, U. Meyer, and E. A. Ramos. I/O-optimal computation of segment intersections. In Abello and Vitter [abello:dimacs], pages 131-138.

Keywords: parallel I/O, out-of-core algorithm, computational geometry, data structure, pario-bib

Comment: See also the component papers vitter:survey, arge:lower, crauser:segment, grossi:crosstrees, toledo:survey. Not clear to what extent these papers are about *parallel* I/O.

cray:pario2:
Cray Research. DS-41 disk subsystem, 1990. Sales literature number MCFS-4-0790.

Keywords: parallel I/O, disk architecture, disk array, pario-bib

Comment: Glossy from Cray describing their new disk subsystem: up two four controllers and up to four ``drives'', each of which actually have four spindles. Thus, a full subsystem has 16 disks. Each drive or controller sustains 9.6 MBytes/sec sustained, for a total of 38.4 MBytes/sec. Each drive has 4.8 GBytes, for a total of 19.2 Gbytes. Access time per drive is 2-46.6 msec, average 24 msec. They don't say how the 4 spindles within a driver are controlled or arranged.

crockett:manual:
Thomas W. Crockett. Specification of the operating system interface for parallel file organizations. Publication status unknown (ICASE technical report), 1988.

Keywords: parallel I/O, parallel file system, pario-bib

Comment: Man pages for his Flex version of file interface. See crockett:par-files.

crockett:par-files:
Thomas W. Crockett. File concepts for parallel I/O. In Proceedings of Supercomputing '89, pages 574-579, 1989.

Keywords: parallel I/O, file access pattern, parallel file system, pario-bib

Comment: Two views of a file: global (for sequential programs) and internal (for parallel programs). Standardized forms for these views, for long-lived files. Temp files have specialized forms. The access types are sequential, partitioned, interleaved, and self-scheduled, plus global random and partitioned random. He relates these to their best storage patterns. No mention of prefetching. Buffer cache only needed for direct (random) access. The application must specify the access pattern desired.

csa-io:
T. J. M. Now: Parallel storage to match parallel CPU power. Electronics, 61(12):112, December 1988.

Keywords: parallel I/O, disk array, pario-bib

cypher:jrequire:
Robert Cypher, Alex Ho, Smaragda Konstantinidou, and Paul Messina. A quantitative study of parallel scientific applications with explicit communication. Journal of Supercomputing, 10(1):5-24, March 1996.
See also earlier version cypher:require.

Keywords: workload characterization, scientific computing, parallel programming, message passing, pario-bib

Comment: Some mention of I/O.

cypher:require:
R. Cypher, A. Ho, S. Konstantinidou, and P. Messina. Architectural requirements of parallel scientific applications with explicit communication. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 2-13, 1993.
See also later version cypher:jrequire.

Keywords: workload characterization, scientific computing, parallel programming, message passing, pario-bib

Comment: Some mention of I/O, though only in a limited way. Average 1207B/MFlop. Some of the applications do I/O throughout their run (2400B/MFlop avg), while others only do I/O at the beginning or end (14B/MFlop avg). But I/O is bursty, so larger bandwidths are suggested. The applications are parallel programs running on Intel Delta, nCUBE/1, nCUBE/2, and are in C, FORTRAN, or both.

davis:rle:
G. Davis, L. Lau, R. Young, F. Duncalfe, and L. Brebber. Parallel run-length encoding (RLE) compression-reducing I/O in dynamic environmental simulations. The International Journal of High Performance Computing Applications, 12(4):396-410, Winter 1998. In a Special Issue on I/O in Parallel Applications.

Abstract: Dynamic simulations based on time-varying inputs are extremely I/O intensive. This is shown by industrial applications generating environmental projections based on seasonal-to-interannual climate forecasts which have a compute to data-access ratio of O(n) leading to significant performance degradation. Exploitation of compression techniques such as Run-Length-Encoding (RLE) significantly reduces the I/O bottleneck and storage requirements. Unfortunately, traditional RLE algorithms do not perform well in a parallel-vector platform such as the Cray architecture. This paper describes the design and implementation of a new RLE algorithm based on data chunking and packing that exploits the Cray gather-scatter vector hardware and multiple processors. This innovative approach reduces I/O and file storage requirements on average by an order of magnitude. Data intensive applications such as the integration of environmental and global climate models now become practical in a realistic time-frame.

Keywords: parallel I/O application, compression, pario-bib

Comment: In a Special Issue on I/O in Parallel Applications, volume 12, numbers 3 and 4.

dazevedo:edonio:
E. F. D'Azevedo and C. H. Romine. EDONIO: Extended distributed object network I/O library. Technical Report ORNL/TM-12934, Oak Ridge National Laboratory, 1995.

Keywords: parallel I/O, pario-bib

debenedictis:modular:
Erik P. DeBenedictis and Juan Miguel del Rosario. Modular scalable I/O. Journal of Parallel and Distributed Computing, 17(1-2):122-128, January and February 1993.

Keywords: parallel I/O, MIMD, pario-bib

Comment: Journalized version of debenedictis:pario, debenedictis:ncube, and delrosario:nCUBE.

debenedictis:ncube:
Erik DeBenedictis and Juan Miguel del Rosario. nCUBE parallel I/O software. In Proceedings of the Eleventh Annual IEEE International Phoenix Conference on Computers and Communications, pages 0117-0124, Scottsdale, AZ, April 1992. IEEE Computer Society Press.

Keywords: parallel file system, parallel I/O, pario-bib

Comment: Interesting paper. Describes their mechanism for mapping I/O so that the file system knows both the mapping of a data structure into memory and on the disks, so that it can do the permutation and send the right data to the right disk, and back again. Interesting Unix-compatible interface. Needs to be extended to handle complex formats.

debenedictis:pario:
Erik DeBenedictis and Peter Madams. nCUBE's parallel I/O with Unix capability. In Proceedings of the Sixth Annual Distributed-Memory Computer Conference, pages 270-277, 1991.

Keywords: parallel I/O, multiprocessor file system, file system interface, pario-bib

Comment: Looks like they give the byte-level mapping, then do normal reads and writes; mapping routes the data to and from the correct place. But it does let you intermix comp with I/O. Elegant concept. Nice interface. Works best for cases where 1) data layout known in advance, data format is known, and mapping is regular enough for easy specification. I think that irregular or unknown mappings could still be done with a flat mapping.

debenedictis:scalable-unix:
Erik P. DeBenedictis and Stephen C. Johnson. Extending Unix for scalable computing. IEEE Computer, 26(11):43-53, November 1993.

Keywords: parallel I/O, Unix, pario-bib

Comment: A more polished version of his other papers with del Rosario. The mapping-based mechanism is released in nCUBE software 3.0. It does support shared file pointers for self-scheduled I/O, as well as support for variable-length records, and asynchronous I/O (although the primary mechanism is for synchronous, i.e., SPMD, I/O). The basic idea of scalable pipes (between programs, devices, etc.) with mappings that determine routings to units seems like a good idea.

debergalis:dafs:
Matt DeBergalis, Peter Corbett, Steve Kleiman, Arthur Lent, Dave Noveck, Tom Talpey, and Mark Wittle. The direct access file system. In Proceedings of the USENIX FAST '03 Conference on File and Storage Technologies, San Francisco, CA, April 2003. USENIX Association.

Abstract: The Direct Access File System (DAFS) is a new, fast, and lightweight remote file system protocol. DAFS targets the data center by addressing the performance and functional needs of clusters of application servers. We call this the local file sharing environment. File access performance is improved by utilizing Direct Access Transports, such as InfiniBand, Remote Direct Data Placement, and the Virtual Interface Architecture. DAFS also enhances file sharing semantics compared to prior network file system protocols. Applications using DAFS through a user-space I/O library can bypass operating system overhead, further improving performance. We present performance measurements of an IP-based DAFS network, demonstrating the DAFS protocol's lower client CPU requirements over commodity Gigabit Ethernet. We also provide the first multiprocessor scaling results for a well-known application (GNU gzip) converted to use DAFS.

Keywords: direct access file system, dafs, remote dma, pario-bib

delrosario:ncube:
Juan Miguel del Rosario. High performance parallel I/O on the nCUBE 2. Transactions of the Institute of Electronics, Information and Communications Engineers, J75D-I(8):626-636, August 1992.

Keywords: parallel I/O, parallel file system, pario-bib

Comment: More detail on the mapping functions, and more flexible mapping functions (can be user specified, or some from a library). Striped disks, parallel pipes, graphics, and HIPPI supported.

delrosario:prospects:
Juan Miguel del Rosario and Alok Choudhary. High performance I/O for parallel computers: Problems and prospects. IEEE Computer, 27(3):59-68, March 1994.

Keywords: parallel I/O, survey, pario-bib

Comment: Nice summary of grand-challenge and other applications, and their I/O needs. Points out the need for quantitative studies of workloads. Comments on architectures, eg, the advent of per-node disk devices. OS problems include communication latency, data decomposition, interface, prefetching and caching, and checkpointing. Runtime system and compilers are important, particularly in reference to data-mapping and re-mapping (see delrosario:two-phase). Persistent object stores and networking are mentioned briefly.

delrosario:two-phase:
Juan Miguel del Rosario, Rajesh Bordawekar, and Alok Choudhary. Improved parallel I/O via a two-phase run-time access strategy. In Proceedings of the IPPS '93 Workshop on Input/Output in Parallel Computer Systems, pages 56-70, Newport Beach, CA, 1993. Also published in Computer Architecture News 21(5), December 1993, pages 31-38.
See also earlier version delrosario:two-phase-tr.

Keywords: parallel I/O, multiprocessor file system, pario-bib

delrosario:two-phase-tr:
Juan Miguel del Rosario, Rajesh Bordawekar, and Alok Choudhary. Improving parallel I/O performance using a two-phase access strategy. Technical Report SCCS-406, NPAC at Syracuse University, 1993.
See also later version delrosario:two-phase.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: They show performance measurements of various data distributions on an nCUBE and the Touchstone Delta, for reading matrix from a column-major file striped across disks, into some distribution across procs. Distributions that don't match the I/O distribution are really terrible, due to having more, smaller requests, and sometimes mismatching the stripe size (getting seg-like contention) or block size (reading partial blocks). They find it is better to read the file using the `best' distribution, then to reshuffle the data in memory. Big speedups.

delrosario:vipfs-tr:
Juan Miguel del Rosario, Michael Harry, and Alok Choudhary. The design of VIP-FS: A virtual, parallel file system for high performance parallel and distributed computing. Technical Report SCCS-628, NPAC, Syracuse, NY 13244, May 1994.

Keywords: parallel I/O, parallel file system, heterogeneous, pario-bib

Comment: They are planning a parallel file system that is layered on top of standard workstation file systems, to be used by parallel applications on heterogeneous workstation clusters. All in user-level libraries, and on a per-application basis, application programs can distributed their data among many files on many machines. They plan to use a mapped interface like that of debenedictis:modular, and support efficient collective I/O in ways reminiscent of bennett:jovian and kotz:diskdir. Published as harry:vipfs.

demmel:eosdis:
James Demmel, Melody Y. Ivory, and Sharon L. Smith. Modeling and identifying bottlenecks in EOSDIS. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 300-308. IEEE Computer Society Press, October 1996.

Abstract: Many parallel application areas that exploit massive parallelism, such as climate modeling, require massive storage systems for the archival and retrieval of data sets. As such, advances in massively parallel computation must be coupled with advances in mass storage technology in order to satisfy I/O constraints of these applications. We demonstrate the effects of such I/O-computation disparity for a representative distributed information system, NASA's Earth Observing System Distributed Information System (EOSDIS). We use performance modeling to identify bottlenecks in EOSDIS for two representative user scenarios from climate change research.

Keywords: climate modeling, performance modeling, parallel I/O, pario-bib

dewitt:gamma:
David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Krishna B. Kumar, and M. Muralikrishna. GAMMA: A high performance dataflow database machine. Technical Report TR-635, Dept. of Computer Science, Univ. of Wisconsin-Madison, March 1986.
See also later version dewitt:gamma2.

Keywords: parallel I/O, database, GAMMA, pario-bib

Comment: Better to cite dewitt:gamma3. Multiprocessor (VAX) DBMS on a token ring with disk at each processor. They thought this was better than separating disks from processors by network since then network must handle all I/O rather than just what needs to move. Conjecture that shared memory might be best interconnection network. Relations are horizontally partitioned in some way, and each processor reads its own set and operates on them there.

dewitt:gamma-dbm:
David J. DeWitt, Shahram Ghandeharizadeh, and Donovan Schneider. A performance analysis of the GAMMA database machine. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 350-360, Chicago, IL, June 1988. ACM Press.
See also later version dewitt:gamma3.

Keywords: parallel I/O, database, performance analysis, Teradata, GAMMA, pario-bib

Comment: Compared Gamma with Teradata. Various operations on big relations. See fairly good linear speedup in many cases. They vary only one variable at a time. Their bottleneck was at the memory-network interface.

dewitt:gamma2:
David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Krishna B. Kumar, and M. Muralikrishna. GAMMA - A high performance dataflow database machine. In Proceedings of the 12th International Conference on Very Large Data Bases, pages 228-237, 1986.
See also earlier version dewitt:gamma.
See also later version dewitt:gamma3.

Keywords: parallel I/O, database, GAMMA, pario-bib

Comment: Almost identical to dewitt:gamma, with some updates. See that for comments, but cite this one. See also dewitt:gamma3 for a more recent paper.

dewitt:gamma3:
David J. DeWitt, Shahram Ghandeharizadeh, Donovan A. Schneider, Allan Bricker, Hui-I Hsaio, and Rick Rasmussen. The Gamma database machine project. IEEE Transactions on Knowledge and Data Engineering, 2(1):44-62, March 1990.
See also earlier version dewitt:gamma2.

Keywords: parallel I/O, database, GAMMA, pario-bib

Comment: An updated version of dewitt:gamma2, with elements of dewitt:gamma-dbm. Really only need to cite this one. This is the same basic idea as dewitt:gamma2, but after they ported the system from the VAXen to an iPSC/2. Speedup results good. Question: how about comparing it to a single-processor, single-disk system with increasing disk bandwidth? That is, how much of their speedup comes from the increasing disk bandwidth, and how much from the actual use of parallelism?

dewitt:pardbs:
David DeWitt and Jim Gray. Parallel database systems: The future of high-performance database systems. Communications of the ACM, 35(6):85-98, June 1992.

Keywords: database, parallel computing, parallel I/O, pario-bib

Comment: They point out that the comments of boral:critique - that database machines were doomed - did really not come true. Their new thesis is that specialized hardware is not necessary and has not been successful, but that parallel database systems are clearly succesful. In particular, they argue for shared-nothing layouts. They survey the state-of-the-art parallel DB systems. Earlier version in Computer Architecture News 12/90.

dewitt:parsort:
David J. DeWitt, Jeffrey F. Naughton, and Donovan A. Schneider. Parallel sorting on a shared-nothing architecture using probabilistic splitting. In Proceedings of the First International Conference on Parallel and Distributed Information Systems, pages 280-291, December 1991.

Keywords: parallel I/O, parallel database, external sorting, pario-bib

Comment: Comparing exact and probabilistic splitting for external sorting on a database. Model and experimental results from Gamma machine. Basically, the idea is to decide on a splitting vector, which defines $N$ buckets for an $N$-process program, and have each program read its initial segment of the data and send each element to the appropriate bucket (other process). All elements received are written to disks as small sorted runs. Then each process mergesorts its runs. Probabilistic split uses only a sample of the elements to define the vector.

dibble:bridge:
Peter Dibble, Michael Scott, and Carla Ellis. Bridge: A high-performance file system for parallel processors. In Proceedings of the Eighth International Conference on Distributed Computer Systems, pages 154-161, June 1988.
See also earlier version ellis:interleaved.
See also later version dibble:thesis.

Keywords: Carla, Bridge, multiprocessor file system, Butterfly, parallel I/O, pario-bib

Comment: See also dibble:*

dibble:sort:
Peter C. Dibble and Michael L. Scott. External sorting on a parallel interleaved file system. University of Rochester 1989-90 Computer Science and Engineering Research Review, 1989.
See also later version dibble:sort2.

Keywords: parallel I/O, sorting, merging, parallel file reference pattern, pario-bib

Comment: Based on Bridge file system (see dibble:bridge). Parallel external merge-sort tool. Sort file on each disk, then do a parallel merge. The merge is serialized by the token-passing mechanism, but the I/O time dominates. The key is to keep disks busy constantly. Uses some read-ahead, write-behind to control fluctuations in disk request timing. Analytical analysis of the algorithm lends insight and matches well with the timings. Locality is a big win in Bridge tools.

dibble:sort2:
Peter C. Dibble and Michael L. Scott. Beyond striping: The Bridge multiprocessor file system. Computer Architecture News, 19(5), September 1989.
See also earlier version dibble:sort.

Keywords: parallel I/O, external sorting, merging, parallel file reference pattern, pario-bib

Comment: Subset of dibble:sort. Extra comments to distinguish from striping and RAID work. Good point that those projects are addressing a different bottleneck, and that they can provide essentially unlimited bandwidth to a single processor. Bridge could use those as individual file systems, parallelizing the overall file system, avoiding the software bottleneck. Using a very-reliable RAID at each node in Bridge could safeguard Bridge against failure for reasonable periods, removing reliability from Bridge level.

dibble:thesis:
Peter C. Dibble. A Parallel Interleaved File System. PhD thesis, University of Rochester, March 1990.

Keywords: parallel I/O, external sorting, merging, parallel file system, pario-bib

Comment: Also TR 334. Mostly covered by other papers, but includes good introduction, discussion of reliability and maintenance issues, and implementation. Short mention of prefetching implied that simple OBL was counter-productive, but later tool-specific buffering with read-ahead was often important. The three interfaces to the PIFS server are interesting. A fourth compromise might help make tools easier to write.

dickens:evaluation:
Phillip M. Dickens and Rajeev Thakur. Evaluation of collective I/O implementations on parallel architectures. Journal of Parallel and Distributed Computing, 61(8):1052-1076, August 2001.

Abstract: In this paper, we evaluate the impact on performance of various implementation techniques for collective I/O operations, and we do so across four important parallel architectures. We show that a naive implementation of collective I/0 does not result in significant performance gains for any of the architectures, but that an optimized implementation does provide excellent performance across all of the platforms under study. Furthermore, we demonstrate that there exists a single implementation strategy that provides the best performance for all four computational platforms. Next, we evaluate implementation techniques for thread-based collective I/O operations. We show that the most obvious implementation technique, which is to spawn a thread to execute the whole collective I/O operation in the background, frequently provides the worst performance, often performing much worse than just executing the collective I/O routine entirely in the foreground. To improve performance, we explore an alternate approach where part of the collective I/O operation is performed in the background, and part is performed in the foreground. We demonstrate that this implementation technique can provide significant performance gains, offering up to a 50% improvement over implementations that do not attempt to overlap collective I/O and computation.

Keywords: parallel I/O, collective I/O, pario-bib, parallel architecture

dickens:javaio:
Phillip M. Dickens and Rajeev Thakur. An evaluation of Java's I/O capabilities for high-performance computing. In Proceedings of the ACM 2000 Java Grande Conference, pages 26-35. ACM Press, June 2000.

Abstract: Java is quickly becoming the preferred language for writing distributed applications because of its inherent support for programming on distributed platforms. In particular, Java provides compile-time and run-time security, automatic garbage collection, inherent support for multithreading, support for persistent objects and object migration, and portability. Given these significant advantages of Java, there is a growing interest in using Java for high-performance computing applications. To be successful in the high-performance computing domain, however, Java must have the capability to efficiently handle the significant I/O requirements commonly found in high-performance computing applications.

While there has been significant research in high-performance I/O using languages such as C, C++, and Fortran, there has been relatively little research into the I/O capabilities of Java. In this paper, we evaluate the I/O capabilities of Java for high-performance computing. We examine several approaches that attempt to provide high-performance I/O-many of which are not obvious at first glance-and investigate their performance in both parallel and multithreaded environments. We also provide suggestions for expanding the I/O capabilities of Java to better support the needs of high-performance computing applications.

Keywords: parallel I/O, Java, pario-bib

dickens:threads:
Phillip Dickens and Rajeev Thakur. Improving collective I/O performance using threads. In Proceedings of the Joint International Parallel Processing Symposium and IEEE Symposium on Parallel and Distributed Processing, pages 38-45, April 1999.

Abstract: Massively parallel computers are increasingly being used to solve large, I/O intensive applications in many different fields. For such applications, the I/O requirements quite often present a significant obstacle in the way of achieving good performance, and an important area of current research is the development of techniques by which these costs can be reduced. One such approach is collective I/O, where the processors cooperatively develop an I/O strategy that reduces the number, and increases the size, of I/O requests, making a much better use of the I/O subsystem. Collective I/O has been shown to significantly reduce the cost of performing I/O in many large, parallel applications, and for this reason serves as an important base upon which we can explore other mechanisms which can further reduce these costs. One promising approach is to use threads to perform the collective I/O in the background while the main thread continues with other computation in the foreground.

In this paper, we explore the issues associated with implementing collective I/O in the background using threads. The most natural approach is to simply spawn off an I/O thread to perform the collective I/O in the background while the main thread continues with other computation. However, our research demonstrates that this approach is frequently the worst implementation option, often performing much more poorly than just executing collective I/O completely in the foreground. To improve the performance of thread-based collective I/O, we developed an alternate approach where part of the collective I/O operation is performed in the background, and part is performed in the foreground. We demonstrate that this new technique can significantly improve the performance of thread-based collective I/O, providing up to an 80% improvement over sequential collective I/O (where there is no attempt to overlap computation with I/O). Also, we discuss one very important application of this research which is the implementation of the split-collective parallel I/O operations defined in MPI 2.0.

Keywords: parallel I/O, multithread programming, collective I/O, disk-directed I/O, two-phase I/O, pario-bib

Comment: They examine an implementation of collective I/O in MPI2 such that the collective I/O is done in the background, using a thread, while the computation continues. They found that the performance can be quite disappointing, because of the competition for the CPU between the computational thread and the background thread executing the redistribution phase of the I/O operation. They get better results by doing the redistribution in the foreground, making the computation wait, and then doing the I/O in the background thread while the computation continues. Results from four major parallel platforms, but only for write operations.

diegert:backprop:
Carl Diegert. Out-of-core backpropagation. In International Joint Conference on Neural Networks, volume 2, pages 97-103, 1990.

Keywords: parallel I/O, neural network, pario-bib

Comment: An application that reads large files, sequentially, on CM2 with DataVault.

ding:oceanmodel:
Chris H. Q. Ding and Yun He. Data organization and I/O in a parallel ocean circulation model. In Proceedings of SC99: High Performance Networking and Computing, Portland, OR, November 1999. ACM Press and IEEE Computer Society Press.

Keywords: scientific application, parallel I/O, ocean modeling, climate modeling, pario-bib

Comment: They describe the approaches taken to optimize an out-of-core parallel ocean model simulation on parallel distributed-memory machines. The original code used fixed size memory windows to store the in-core portions of dataset on the machine. The code used the same approach for machines that had enough memory to store the entire data set in-core, except rather than reading and writing to disk, the code copied to/from ramdisk (very copy intensive). The new code added an option to allow the entire dataset to be run in-core. Another place where the code could be optimized was in the writing of the dataset. For computational efficiency, the data was stored in memory as an array U(ix,iz,iy), but other applications needed the data stored on disk as U(ix,iy,iz). To optimize the I/O, the new code allocated additional processors to gather and re-organize and write the data to disk (much like Salvo).

drapeau:raid-ii:
Ann L. Drapeau, Ken W. Shirrif, John H. Hartman, Ethan L. Miller, Srinivasan Seshan, Randy H. Katz, Ken Lutz, David A. Patterson, Edward K. Lee, Peter H. Chen, and Garth A. Gibson. RAID-II: a high-bandwidth network file server. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pages 234-244, 1994.
See also earlier version chen:raid2.

Keywords: RAID, disk array, network file system, parallel I/O, pario-bib

Comment: See also chen:raid2. The only significant addition in this paper is a discussion of the performance of the RAID-II running an LFS file system.

drapeau:tape-stripe:
Ann L. Drapeau and Randy H. Katz. Striping in large tape libraries. In Proceedings of Supercomputing '93, pages 378-387, Portland, OR, 1993. IEEE Computer Society Press.

Keywords: parallel I/O, pario-bib

Comment: RAID-3 striping across drives in a tape robot, using 3 data plus one parity. Tape-switch time is very high, ie, 4 minutes. Switching four tapes at the same time would only get a little overlap, because there is only one robot arm. Assume large request size. Striping is much faster when only one request is considered, but with many requests outstanding, response time goes way down due to limited concurrency. More readers with the same stripe group size alleviate the contention and allow concurrency. Faster readers is the most important thing to improve performance, more important than improving robot speed. As both speeds improve the benefit of striping diminishes. Seems like this could be expressed in a simple equation...

dunigan:hypercubes:
T. H. Dunigan. Performance of the Intel iPSC/860 and Ncube 6400 hypercubes. Parallel Computing, 17:1285-1302, 1991.

Keywords: intel, ncube, hypercube, multiprocessor architecture, performance, parallel I/O, pario-bib

Comment: An excellent paper presenting lots of detailed performance measurements on the iPSC/1, iPSC/2, iPSC/860, nCUBE 3200, and nCUBE 6400: arithmetic, FLOPS, communication, I/O. Tables of numbers provide details needed for simulation. iPSC/860 definitely is fastest, but way out of balance wrt communication vs. computation. Number of message hops is not so important in newer machines.

durand:coloring:
Dannie Durand, Ravi Jain, and David Tseytlin. Applying randomized edge coloring algorithms to distributed communication: An experimental study. In Proceedings of the Seventh Symposium on Parallel Algorithms and Architectures, pages 264-274, 1995.

Keywords: parallel I/O, scheduling, pario-bib

Comment: They note that the set of data transfers in a parallel I/O architecture can be expressed as a graph coloring problem. Realistically, a centralized solution is not possible because the information is inherently distributed. So they develop some distributed algorithms and experimentally compare them to the centralized algorithm. They get within 5% and do better than earlier algorithms.

durand:edge-coloring:
Dannie Durand, Ravi Jain, and David Tseytlin. Parallel i/o scheduling using randomized, distributed edge coloring algorithms.. Journal of Parallel and Distributed Computing, 63(6):611-618, June 2003.

Abstract: A growing imbalance in CPU (central processing unit) and I/O (input/output) speeds has led to a communications bottleneck in distributed architectures, especially for data intensive applications such as multimedia information systems, databases, and grand challenge problems. Our solution is to schedule parallel I/O operations explicitly. We present a class of decentralized scheduling algorithms that eliminate contention for I/O ports while maintaining an efficient use of bandwidth. These algorithms based on edge coloring and matching of bipartite graphs, rely upon simple heuristics to obtain shorter schedules. We use simulation to evaluate the ability of our algorithms to obtain near optimal solutions in a distributed context, and compare our work with that of other researchers. Our results show that our algorithms produce schedules within 5 of the optimal schedule, a substantial improvement over existing algorithms. (20 refs.)

Keywords: randomized edge coloring, scheduling algorithms, bipartite graphs, parallel I/O, pario-bib

durand:scheduling:
Dannie Durand, Ravi Jain, and David Tseytlin. Distributed scheduling algorithms to improve the performance of parallel data transfers. In Proceedings of the IPPS '94 Workshop on Input/Output in Parallel Computer Systems, pages 85-104. Bellcore, April 1994. Also appeared in Computer Architecture News 22(4).
See also later version durand:scheduling-book.

Keywords: parallel I/O algorithms, pario-bib

Comment: They devise some decentralized algorithms to generate schedules for data transfers between a set of clients and a set of servers when the complete set of transfers is known in advance, and the clients and servers are fairly tightly synchronized. They concentrate on the limitation that clients and servers may each only participate in one transfer at any given moment; interconnect bandwidth is not an issue. Their simulations show that their algorithms come within 20% of optimal.

durand:scheduling-book:
Dannie Durand, Ravi Jain, and David Tseytlin. Improving the performance of parallel I/O using distributed scheduling algorithms. In Jain et al. [iopads-book], chapter 11, pages 245-269.
See also earlier version durand:scheduling.

Abstract: The cost of data transfers, and in particular of I/O operations, is a growing problem in parallel computing. This performance bottleneck is especially severe for data-intensive applications such as multimedia information systems, databases, and Grand Challenge problems. A promising approach to alleviating this bottleneck is to schedule parallel I/O operations explicitly.

Although centralized algorithms for batch scheduling of parallel I/O operations have previously been developed, they are not be appropriate for all applications and architectures. We develop a class of decentralized algorithms for scheduling parallel I/O operations, where the objective is to reduce the time required to complete a given set of transfers. These algorithms, based on edge-coloring and matching of bipartite graphs, rely upon simple heuristics to obtain shorter schedules. We present simulation results indicating that the best of our algorithms can produce schedules whose length (or makespan) is within 2 - 20% of the optimal schedule, a substantial improvement on previous decentralized algorithms. We discuss theoretical and experimental work in progress and possible extensions.

Keywords: parallel I/O, distributed scheduling algorithm, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

duzett:ncube3:
Bob Duzett and Ron Buck. An overview of the nCUBE 3 supercomputer. In Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation, pages 458-464, 1992.

Keywords: parallel computer architecture, MIMD, pario-bib

Comment: Basically the same architecture as the nCUBE/2, scaled up. Eight to 65K processors, each 50 MIPS and 100 DP MFLOPS, initially 50 MHz. RISC. 16 hypercube channels and 2 I/O channels per processor. CPU chip includes MMU, TLB, I- and D-cache, hypercube and I/O channels, and memory interface. The channels have DMA support built-in (5 usec startup overhead, worst-case end-to-end latency 10 usec), and can talk directly to the memory interface or to the cache. 64-bit virtual address space, with 48 bits implemented. Hardware support for distributed virtual memory. Separate 16-node hypercube is used for I/O processing, with up to 400 disks attached. Packaging includes multi-chip module with DRAMs stacked directly on the CPU chip, fluid-cooled, so that an entire node is one package, with the 18 network links as essentially its only external connections.

edelson:pario:
Daniel Edelson and Darrell D. E. Long. High speed disk I/O for parallel computers. Technical Report UCSC-CRL-90-02, Baskin Center for Computer Engineering and Information Science, January 1990.

Keywords: parallel I/O, disk caching, parallel file system, log-structured file system, Intel iPSC/2, pario-bib

Comment: Essentially a small literature survey. No new ideas here, but it is a reasonable overview of the situation. Mentions caching, striping, disk layout optimization, log-structured file systems, and Bridge and Intel CFS. Plugs their ``Swift'' architecture (see cabrera:pario).

el-ghazawi:mp1:
Tarek A. El-Ghazawi. I/O performance of the MasPar MP-1 testbed. Technical Report TR-94-111, CESDIS, NASA GSFC, Greenbelt, MD, 1994.

Keywords: parallel I/O, parallel architecture, performance evaluation, pario-bib

Comment: See el-ghazawi:mpio.

el-ghazawi:mpio:
Tarek A. El-Ghazawi. Characteristics of the MasPar parallel I/O system. In Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pages 265-272, 1995.

Keywords: parallel I/O, parallel architecture, performance evaluation, pario-bib

Comment: See el-ghazawi:mp1.

elford:ppfs-detail:
Chris Elford, Chris Kuszmaul, Jay Huber, and Tara Madhyastha. Portable parallel file system detailed design. Technical report, University of Illinois at Urbana-Champaign, November 1993.

Keywords: parallel file system, parallel I/O, pario-bib

Comment: See also elford:ppfs-tr, huber:ppfs.

elford:ppfs-tr:
Chris Elford, Jay Huber, Chris Kuszmaul, and Tara Madhyastha. PPFS high level design documentation. Technical report, University of Illinois at Urbana-Champaign, November 1993.

Keywords: parallel file system, parallel I/O, pario-bib

Comment: See also elford:ppfs-detail, huber:ppfs-scenarios, huber:ppfs.

elford:trends:
Chris L. Elford and Daniel A. Reed. Technology trends and disk array performance. Journal of Parallel and Distributed Computing, 46(2):136-147, November 1997.

Keywords: trends, disk technology, disk array, parallel I/O, pario-bib

ellis:interleaved:
Carla Ellis and P. Dibble. An interleaved file system for the Butterfly. Technical Report CS-1987-4, Dept. of Computer Science, Duke University, January 1987.
See also later version dibble:bridge.

Keywords: Carla, multiprocessor file system, Bridge, Butterfly, parallel I/O, pario-bib

ellis:prefetch:
Carla Schlatter Ellis and David Kotz. Prefetching in file systems for MIMD multiprocessors. In Proceedings of the 1989 International Conference on Parallel Processing, pages I:306-314, St. Charles, IL, August 1989. Pennsylvania State Univ. Press.
See also earlier version ellis:prefetchTR.
See also later version kotz:prefetch.

Abstract: The problem of providing file I/O to parallel programs has been largely neglected in the development of multiprocessor systems. There are two essential elements of any file system design intended for a highly parallel environment: parallel I/O and effective caching schemes. This paper concentrates on the second aspect of file system design and specifically, on the question of whether prefetching blocks of the file into the block cache can effectively reduce overall execution time of a parallel computation, even under favorable assumptions.

Experiments have been conducted with an interleaved file system testbed on the Butterfly Plus multiprocessor. Results of these experiments suggest that 1) the hit ratio, the accepted measure in traditional caching studies, may not be an adequate measure of performance when the workload consists of parallel computations and parallel file access patterns, 2) caching with prefetching can significantly improve the hit ratio and the average time to perform an I/O operation, and 3) an improvement in overall execution time has been observed in most cases. In spite of these gains, prefetching sometimes results in increased execution times (a negative result, given the optimistic nature of the study).

We explore why is it not trivial to translate savings on individual I/O requests into consistently better overall performance and identify the key problems that need to be addressed in order to improve the potential of prefetching techniques in this environment.

Keywords: dfk, parallel file system, prefetching, disk caching, MIMD, parallel I/O, pario-bib

englert:nonstop:
Susanne Englert, Jim Gray, Terrye Kocher, and Praful Shah. A benchmark of NonStop SQL Release 2 demonstrating near-linear speedup and scaleup on large databases. In Proceedings of the 1990 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 245-246, May 1990.

Abstract: NonStop SQL is an implementation of ANSI/ISO SQL on Tandem Computer Systems. In its second release, NonStop SQL transparently and automatically implements parallelism within an SQL statement by exploiting Tandem's multiprocessor architecture. For basic queries on a uniform database, it achieves performance that is near-linear with respect to the number of processors and disks used. The authors describe benchmarks demonstrating these results and the technology used to achieve them.

Keywords: parallel database, parallel architecture, parallel I/O, pario-bib

Comment: They (briefly) describe the Tandem NonStop system, including their disk nodes (which contain CPU, memory, and disk) and their use. A query involves sending a request to all the disk nodes, who independently read the appropriate data from their local disk, filter out all the interesting records, and send only those interesting records to the originator for processing. This is an early example of smart (programmable) I/O nodes.

esser:paragon:
Rüdiger Esser and Renate Knecht. Intel Paragon XP/S - architecture and software environment. Technical Report KFA-ZAM-IB-9305, Central Institute for Applied Mathematics, Research Center Jülich, Germany, \verb+r.esser@kfa-juelich.de+, April 26 1993.

Keywords: multiprocessor architecture, pario-bib

Comment: A nice summary of the Paragon architecture and OS. Some information that is not found in Intel's technical summary, and with much less marketing hype. But, it was written in April 1993 with a look to the future, so it may represent things that are not ready yet. Network interface allows user-mode msgs, DMA direct to user space if receive has been posted; else there is a new queue for every possible sending processor. They plan to expand the nodes to 4-processors and 64-128 MB. PFS stripes across RAIDs. Now SCSI-1 with 5 MB/s, later 10 MB/s SCSI-2, then 20 MB/s fast SCSI-2. See also intel:paragon.

falkenberg:server:
Charles Falkenberg, Paul Hagger, and Steve Kelley. A server of distributed disk pages using a configurable software bus. Technical Report CS-TR-3082, Dept. of Computer Science, University of Maryland, July 1993. Also cross-referenced as UMIACS-TR-93-47.

Abstract: As network latency drops below disk latency, access time to a remote disk will begin to approach local disk access time. The performance of I/O may then be improved by spreading disk pages across several remote disk servers and accessing disk pages in parallel. To research this we have prototyped a data page server called a Page File. This persistent data type provides a set of methods to access disk pages stored on a cluster of remote machines acting as disk servers. The goal is to improve the throughput of database management system or other I/O intensive application by accessing pages from remote disks and incurring disk latency in parallel. This report describes the conceptual foundation and the methods of access for our prototype.

Keywords: parallel I/O, network, virtual memory, parallel database, pario-bib

Comment: An early document on a system under development. It declusters pages of a file across many page servers, and provides an abstraction of a linearly ordered collection of pages. The intended use is by database systems. As it stands now, there is little here other than block declustering, and thus, nothing new to the I/O community. Perhaps later they will develop interesting new caching or prefetching strategies.

fallah-adl:data:
Hassan Fallah-Adl, Joseph JáJá, Shunlin Liang, Yoram J. Kaufman, and John Townshend. Efficient algorithms for atmospheric correction of remotely sensed data. In Proceedings of Supercomputing '95, San Diego, CA, 1995. IEEE Computer Society Press.

Abstract: Remotely sensed imagery has been used for developing and validating various studies regarding land cover dynamics. However, the large amounts of imagery collected by the satellites are largely contaminated by the effects of atmospheric particles. The objective of atmospheric correction is to retrieve the surface reflectance from remotely sensed imagery by removing the atmospheric effects. We introduce a number of computational techniques that lead to a substantial speedup of an atmospheric correction algorithm based on using look-up tables. Excluding I/O time, the previous known implementation processes one pixel at a time and requires about 2.63 seconds per pixel on a SPARC-10 machine, while our implementation is based on processing the whole image and takes about 4-20 microseconds per pixel on the same machine. We also develop a parallel version of our algorithm that is scalable in terms of both computation and I/O. Experimental results obtained show that a Thematic Mapper (TM) image (36 MB per band, 5 bands need to be corrected) can be handled in less than 4.3 minutes on a 32-node CM-5 machine, including I/O time.

Keywords: remote sensing, parallel I/O application, pario-bib

Comment: Note proceedings only on CD-ROM or WWW.

feitelson:bpario:
Dror G. Feitelson, Peter F. Corbett, Sandra Johnson Baylor, and Yarsun Hsu. Parallel I/O subsystems in massively parallel supercomputers. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 25, pages 389-407. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version feitelson:pario.

Keywords: multiprocessor file system, parallel I/O, Vesta, pario-bib

Comment: Part of jin:io-book. Excellent survey. Reformatted version of feitelson:pario.

feitelson:pario:
Dror G. Feitelson, Peter F. Corbett, Sandra Johnson Baylor, and Yarsun Hsu. Parallel I/O subsystems in massively parallel supercomputers. IEEE Parallel and Distributed Technology, 3(3):33-47, Fall 1995.
See also earlier version feitelson:pario-tr.
See also later version feitelson:bpario.

Abstract: Applications executing on massively parallel supercomputers require a high aggregate bandwidth of I/O with low latency. This requirement cannot be satisfied by an external file server. Once solution is to employ an internal parallel I/O subsystem, in which I/O nodes with DASD are linked wo the same interconnection network that connects the compute nodes. The option of increasing the number of I/O nodes together with the number of compute nodes allows for a balanced architecture. Indeed, most multicomputer vendors provide internal parallel I/O subsystems as part of their product offerings. However, these systems typically attempt to preserve a Unix-compatible interface, hiding or abstracting the parallelism. New interfaces may be required to fully utilize the capabilities of Parallel I/O.

Keywords: multiprocessor file system, parallel I/O, Vesta, pario-bib

Comment: A very nice survey of multiprocessor file systems issues. Published version of feitelson:pario-tr.

feitelson:pario-tr:
Dror G. Feitelson, Peter F. Corbett, Sandra Johnson Baylor, and Yarsun Hsu. Satisfying the I/O requirements of massively parallel supercomputers. Technical Report Research Report RC 19008 (83016), IBM T. J. Watson Research Center, July 1993.
See also later version feitelson:pario.

Keywords: multiprocessor file system, parallel I/O, Vesta, pario-bib, OS94W

Comment: A very nice survey of multiprocessor file systems issues. They make a good point that I/O needs would increase if I/O capabilities increase, because people would output more interations, more complete data sets, etc.\ They make the case for internal file systems, the use of dedicated I/O nodes, the attachment of every RAID to two I/O nodes for reliability, the Vesta interface, and user control over the view of a parallel file. See also corbett:vesta*. Published as feitelson:pario.

feitelson:terminal:
Dror G. Feitelson. Terminal I/O for massively parallel systems. In Proceedings of the Scalable High-Performance Computing Conference, pages 263-270, 1994.

Keywords: parallel I/O, pario-bib

Comment: How to deal with stdin/stdout on a parallel processor. Basically, each task is given its own window, where the user can see the output and type input to that task. Then, they have a window for LEDs, ie, little squares, one for each task. The square changes color depending on the situation. The default is to turn green when output is available, red when waiting for input, and white when the window is currently open. Clicking on these opens the appropriate window, so there is some control over which windows you are watching. They also provide a programmer interface to allow the programmer to control the LED color.

feitelson:vesta-perf:
Dror G. Feitelson, Peter F. Corbett, and Jean-Pierre Prost. Performance of the Vesta parallel file system. In Proceedings of the Ninth International Parallel Processing Symposium, pages 150-158, April 1995.
See also earlier version feitelson:vesta-perf-tr.

Keywords: parallel I/O, multiprocessor file system, Vesta, pario-bib

Comment: See feitelson:vesta-perf-tr.

feitelson:vesta-perf-tr:
Dror G. Feitelson, Peter F. Corbett, and Jean-Pierre Prost. Performance of the Vesta parallel file system. Technical Report RC 19760 (87534), IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, September 1994.
See also later version feitelson:vesta-perf.

Keywords: parallel I/O, multiprocessor file system, Vesta, pario-bib

Comment: Cite feitelson:vesta-perf. A good performance study of Vesta running on an SP-1. See corbett:jvesta for ultimate reference. In all, Vesta performed very well both for single-node and multiple-node performance. I wish that they had tried some very small BSUs; at one point they tried 16-byte BSUs and the performance looked very poor. Section on I/O vectors is confusing.

feitelson:xml:
Dror G. Feitelson and Tomer Klainer. XML, hyper-media, and Fortran I/O. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 43, pages 633-644. IEEE Computer Society Press and Wiley, New York, NY, 2001.

Keywords: parallel I/O, parallel I/O interface, pario-bib

Comment: Part of jin:io-book.

feng:io-response:
Dan Feng, Hong Jiang, and Yifeng Zhu. I/O response time in a fault-tolerant parallel virtual file system. Lecture Notes in Computer Science, 3222:248-251, October 2004.

Abstract: A fault tolerant parallel virtual file system is designed and implemented to provide high I/O performance and high reliability. A queuing model is used to analyze in detail the average response time when multiple clients access the system. The results show that I/O response time is with a function of several operational parameters. It decreases with the increase in I/O buffer hit rate for read requests, write buffer size for write requests and number of server nodes in the parallel file system, while higher I/O requests arrival rate increases I/O response time.

Keywords: fault-tolerance, PVFS, perforamance analysis, pario-bib

feng:performance:
Dan Feng, Hong Jiang, and Yi-Feng Zhu. I/O performance of an RAID-10 style parallel file system. Journal of Computer Science and Technology, 19(6):965-972, November 2004.

Abstract: Without any additional cost, all the disks on the nodes of a cluster can be connected together through CEFT-PVFS, an RAID-10 style parallel file system, to provide a multi-GB/s parallel I/O performance. I/O response time is one of the most important measures of quality of service for a client. When multiple clients submit data-intensive jobs at the same time, the response time experienced by the user is an indicator of the power of the cluster. In this paper, a queuing model is used to analyze in detail the average response time when multiple clients access CEFT-PVFS. The results reveal that response time is with a function of several operational parameters. The results show that I/O response time decreases with the increases in I/O buffer hit rate for read requests, write buffer size for write requests and the number of server nodes in the parallel file system, while the higher the I/O requests arrival rate, the longer the I/O response time. On the other hand, the collective power of a large cluster supported by CEFT-PVFS is shown to be able to sustain a steady and stable I/O response time for a relatively large range of the request arrival rate.

Keywords: PVFS, parallel I/O, I/O response time, pario-bib

ferragina:soda96:
Paolo Ferragina and Roberto Grossi. Fast string searching in secondary storage: Theoretical developments and experimental results. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA `96), pages 373-382, Atlanta, June 1996. ACM Press.

Abstract: In a previous work [Ferragina-Grossi, ACM STOC 95], we proposed a text indexing data structure for secondary storage, which we called SB-tree, that combines the best of B-trees and suffix arrays, overcoming the limitations of inverted files, suffix arrays, suffix trees, and prefix B-trees. In this paper we study the performance of SB-trees in a practical setting, performing a set of searching and updating experiments. Improved performance was obtained by a new space efficient and alphabet-independent organization of the internal nodes of the SB-tree, and a new batch insertion procedure that avoids thrashing.

Keywords: out-of-core algorithm, parallel I/O, pario-bib

ferragina:stoc95:
Paolo Ferragina and Roberto Grossi. A fully-dynamic data structure for external substring search. In Proceedings of the 27th Annual ACM Symposium on Theory of Computing, pages 693-702, Las Vegas, 1995. ACM Press.

Keywords: out-of-core algorithm, parallel I/O, pario-bib

ferreira:data-intensive:
Renato Ferreira, Gagan Agrawal, and Joel Saltz. Data parallel language and compiler support for data intensive applications. Parallel Computing, 28(5):725-748, May 2002.

Abstract: Processing and analyzing large volumes of data plays an increasingly important role in many domains of scientific research. High-level language and compiler support for developing applications that analyze and process such datasets has, however, been lacking so far.

In this paper, we present a set of language extensions and a prototype compiler for supporting high-level object-oriented programming of data intensive reduction operations over multidimensional data. We have chosen a dialect of Java with data-parallel extensions for specifying a collection of objects, a parallel for loop, and reduction variables as our source high-level language. Our compiler analyzes parallel loops and optimizes the processing of datasets through the use of an existing run-time system, called active data repository (ADR). We show how loop fission followed by interprocedural static program slicing can be used by the compiler to extract required information for the run-time system. We present the design of a compiler/run-time interface which allows the compiler to effectively utilize the existing run-time system.

A prototype compiler incorporating these techniques has been developed using the Titanium front-end from Berkeley. We have evaluated this compiler by comparing the performance of compiler generated code with hand customized ADR code for three templates, from the areas of digital microscopy and scientific simulations. Our experimental results show that the performance of compiler generated versions is, on the average 21% lower, and in all cases within a factor of two, of the performance of hand coded versions.

Keywords: parallel I/O, parallel applications, data parallel, pario-bib

ferreira:microscope:
Renato Ferreira, Bongki Moon, Jim Humphries, Alan Sussman, Joel Saltz, Robert Miller, and Angelo Demarzo. The virtual microscope. In American Medical Informatics Association, 1997 Annual Fall Symposium, pages 449-453, Nashville, TN, October 1997.

Keywords: pario-bib, application

Comment: Best Application Paper award.

This paper describes a client/server application that emulates a high power light microscope. They use wavelet compression to reduce the size of each of the electronic slides and they use a parallel data server much like the ones used for sattelite image data (see chang:titan) to service data requests.

fineberg:nht1:
Samuel A. Fineberg. Implementing the NHT-1 application I/O benchmark. In Proceedings of the IPPS '93 Workshop on Input/Output in Parallel Computer Systems, pages 37-55, Newport Beach, CA, 1993. Also published in Computer Architecture News 21(5), December 1993, pages 23-30.

Keywords: parallel I/O, multiprocessor file system, benchmark, pario-bib

Comment: See also carter:benchmark. Some preliminary results from one of their benchmarks. Note: ``I was only using a single Cray disk with a maximum transfer rate of 9.6MBytes/sec.'' - Fineberg.

fineberg:pmpio:
Samuel A. Fineberg, Parkson Wong, Bill Nitzberg, and Chris Kuszmaul. PMPIO- a portable implementation of MPI-IO. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 188-195. IEEE Computer Society Press, October 1996.

Abstract: MPI-IO provides a demonstrably efficient portable parallel Input/Output interface, compatible with the MPI standard. PMPIO is a "reference implementation" of MPI-IO, developed at NASA Ames Research Center. To date, PMPIO has been ported to the IBM SP-2, SGI and Sun shared memory workstations, the Intel Paragon, and the Cray J90. Preliminary results using the PMPIO implementation of MPI-IO show an improvement of as much as a factor of 20 on the NAS BTIO benchmark compared to a Fortran based implementation. We show comparative results on the SP-2 Paragon, and SGI architectures.

Keywords: parallel I/O, pario-bib

flynn:hyper-fs:
Robert J. Flynn and Haldun Hadimioglu. A distributed hypercube file system. In Proceedings of the Third Conference on Hypercube Concurrent Computers and Applications, pages 1375-1381, Pasadena, CA, 1988. ACM Press.

Keywords: parallel I/O, hypercube, parallel file system, pario-bib

Comment: For hypercube-like architectures. Interleaved files, though flexible. Separate network for I/O, maybe not hypercube. I/O is blocked and buffered - no coherency or prefetching issues discussed. Buffered close to point of use. Parallel access is ok. Broadcast supported? I/O nodes distinguished from comp nodes. I/O hooked to front-end too. See hadimioglu:fs and hadimioglu:hyperfs

ford:rail:
Daniel A. Ford, Robert J. T. Morris, and Alan E. Bell. Redundant arrays of independent libraries (RAIL): the StarFish tertiary storage system. Parallel Computing, 24(1):45-64, January 1998.

Abstract: Increased computer networking has sparked a resurgence of the `on-line' revolution of the 1970's, making ever larger amounts of data available on a world wide basis and placing greater demands on the performance and availability of tertiary storage systems. In this paper, we argue for a new approach to tertiary storage system architecture that is obtained by coupling multiple small and inexpensive `building block' libraries (or jukeboxes) together to create larger tertiary storage systems. We call the resulting system a RAIL and show that it has performance and availability characteristics superior to conventional tertiary storage systems, for almost the same dollar/megabyte cost. A RAIL system is the tertiary storage equivalent of a fixed magnetic disk RAID storage system, but with several additional features that enable the ideas of data striping and redundancy to function efficiently on dismountable media and robotic media mounting systems. We present the architecture of such a system called Starfish I and describe the implementation of a prototype. We also introduce the idea of creating a log-structured library array (LSLA) on top of a RAIL architecture (StarFish II) and show how it can have write performance equivalent to that of secondary storage, and improved read performance along with other advantages such as easier compression and the elimination of the 4*RAID/RAIL write penalty.

Keywords: parallel I/O, redundant data, striping, tertiary storage, pario-bib

Comment: Part of a special issue.

foster:arrays:
Ian Foster and Jarek Nieplocha. Disk resident arrays: An array-oriented i/o library for out-of-core computations. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 33, pages 488-498. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version nieplocha:arrays.

Keywords: parallel I/O, pario-bib

Comment: Part of jin:io-book; reformatted version of nieplocha:arrays.

foster:chemio:
Ian Foster and Jarek Nieplocha. ChemIO: High-performance I/O for computational chemistry applications. WWW \mbox{http://www.mcs.anl.gov/chemio/}, February 1996.
See also later version nieplocha:chemio.

Keywords: computational science, chemistry, parallel I/O, pario-bib

Comment: A library package for computational chemistry programs. It supports out-of-core arrays. See also nieplocha:chemio.

foster:climate:
Ian Foster, Mark Henderson, and Rick Stevens. Data systems for parallel climate models. Technical Report ANL/MCS-TM-169, Argonne National Laboratory, July 1991. Copies of slides from a workshop by this title, with these organizers.

Keywords: parallel I/O, parallel database, multiprocessor file system, climate model, grand challenge, tertiary storage, archival storage, RAID, tape robot, pario-bib

Comment: Includes the slides from many presenters covering climate modeling, data requirements for climate models, archival storage systems, multiprocessor file systems, and so forth. NCAR data storage growth rates (p. 54), 500 bytes per MFlop, or about 8 TB/year with Y/MP-8. Average file length 26.2 MB. Migration across both storage hierarchy and generations of media. LLNL researcher: typical 50-year, 3-dimensional model with 5-degree resolution will produce 75 GB of output. Attendee list included.

foster:remote-io:
Ian Foster, David Kohr, Jr., Rakesh Krishnaiyer, and Jace Mogill. Remote I/O: Fast access to distant storage. In Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed Systems, pages 14-25, San Jose, CA, November 1997. ACM Press.

Abstract: As high-speed networks make it easier to use distributed resources, it becomes increasingly common that applications and their data are not colocated. Users have traditionally addressed this problem by manually staging data to and from remote computers. We argue instead for a remote I/O paradigm in which programs use familiar parallel I/O interfaces to access remote filesystems. In addition to simplifying remote execution, remote I/O can improve performance relative to staging by overlapping computation and data transfer or by reducing communication requirements. However, remote I/O also introduces new technical challenges in the areas of portability, performance, and integration with distributed computing systems. We propose techniques designed to address these challenges and describe a remote I/O library called RIO that we are developing to evaluate the effectiveness of these techniques. RIO addresses issues of portability by adopting the quasi-standard MPI-IO interface and by defining a RIO device and RIO server within the ADIO abstract I/O device architecture. It addresses performance issues by providing traditional I/O optimizations such as asynchronous operations and through implementation techniques such as buffering and message forwarding to offload communication overheads. Microbenchmarks and application experiments demonstrate that our techniques can improve turnaround time relative to staging.

Keywords: parallel I/O, distributed file system, pario-bib

Comment: They want to support users that have datasets at different locations in the Internet, but need to access the data at supercomputer parallel machines. Rather than staging data in and out, they want to provide remote access. Issues: naming, dynamic loads, heterogeneity, security, fault-tolerance. All traffic goes through a 'forwarder node' that funnels all the traffic into the network. They use URLs for pathnames (e.g., "x-rio://..."). They find that non-blocking ops are important, as is collective I/O. They think that buffering will be important. Limited experiments.

fox:cubes:
G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors, volume 1. Prentice Hall, Englewood Cliffs, NJ, 1988.

Keywords: hypercube, pario-bib

Comment: See fox:cubix for parallel I/O.

fox:cubix:
G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors, chapter 6 and 15. Volume 1 of [fox:cubes], 1988.

Keywords: parallel file system, hypercube, pario-bib

Comment: Parallel I/O control, called CUBIX. Interesting method. Depends a lot on ``loose synchronization'', which is sortof SIMD-like.

franke:filters:
Ernest Franke and Michael Magee. Reducing data distribution bottlenecks by employing data visualization filters. In Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing, pages 255-262, Redondo Beach, CA, August 1999. IEEE Computer Society Press.

Abstract: Between 1994 and 1997, researchers at Southwest Research Institute (SwRI) investigated methods for distributing parallel computation and data visualization under the support of an internally funded Research Initiative Program entitled the Advanced Visualization Technology Project (AVTP). A hierarchical data cache architecture was developed to provide a flexible interface between the modeling or simulation computational processes and data visualization programs. Compared to conventional post facto data visualization approaches, this data cache structure provides many advantages including simultaneous data access by multiple visualization clients, comparison of experimental and simulated data, and visual analysis of computer simulation as computation proceeds.

However, since the data cache was resident on a single workstation, this approach did not address the issue of scalability of methods for avoiding the data storage bottleneck by distributing the data across multiple networked workstations. Scalability through distributed database approaches is being investigated as part of the Applied Visualization using Advanced Network Technology Infrastructure (AVANTI) project.

This paper describes a methodology currently under development that is intended to avoid bottlenecks that typically arise as the result of data consumers (e.g. visualization applications) that must access and process large amounts of data that has been generated and resides on other hosts, and which must pass through a central data cache prior to being used by the data consumer. The methodology is based on a fundamental paradigm that the end result (visualization) rendered by a data consumer can, in many cases, be produced using a reduced data set that has been distilled or filtered from the original data set.

In the most basic case, the filtered data used as input to the data consumer may simply be a proper subset of massive data sets that have been distributed among hosts. For the general case, however, the filtered data may bear no resemblance to the original data since it is the result of processing the raw data set and distilling it to its visual "essence", i.e. the minimal data set that is absolutely required by the data consumer in order to perform the required rendering function. Data distribution bottlenecks for visualization applications are thereby reduced by avoiding the transfer of large amounts of raw data in favor of considerably distilled visual data.

There are, of course, computational costs associated with this approach since raw data must be processed into its visual essence, but these computational costs may be distributed among multiple processors. It should be realized, however, that, in general, these computational costs would exist any way since, for the visualization to be performed, there must be a transformation between the raw data and the visualization primitives (e.g. line segments, polygon vertices, etc.) to be rendered. The main principal put forth by this paper is that if data distribution bottlenecks are to be minimized, the amount of raw data transferred should be reduced by employing data filtering processes that can be distributed among multiple hosts.

The complete paper demonstrates, both analytically and experimentally, that this approach becomes increasingly effective (scalable) as the computational expense associated with the data filtering transformation rises.

Keywords: distributed computing, filters, grid, input/output, parallel I/O, pario-bib, app-pario

Comment: The goal of their work is to improve the performance of data visualization applications which use remote the data generators (disk or running application) and data consumers (the visualization station) for visualzation applications. They deal with network bottlenecks by using a distributed-redundant data cache to hold intermediate data between the data generator and the data consumer. They also reduce network traffic by applying data filters to the data at the distributed cache processors. The main argument is that since the data must be filtered before it is visualized, it makes more sense to perform the filter at the data cache so the computation can be distributed and to reduce the amount of data that needs to be transferred to the data consumer.

freedman:spiffi:
Craig S. Freedman, Josef Burger, and David J. Dewitt. SPIFFI - a scalable parallel file system for the Intel Paragon. IEEE Transactions on Parallel and Distributed Systems, 7(11):1185-1200, November 1996.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: The complete paper on the SPIFFI parallel file system. It seems to be much like Intel CFS from the programmer's point of view, with a few new file modes, user-selectable striping granularity. Their Paragon, though a source of problems, had a disk on every node (though they do not take advantage of that in this work). They have a buffer pool on each I/O node, which does prefetching in a somewhat novel way.

freedman:video:
Craig S. Freedman and David J. DeWitt. The SPIFFI scalable video-on-demand system. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 352-363. ACM Press, 1995.

Keywords: parallel file system, multimedia, video server, pario-bib

Comment: See also freedman:spiffi. They simulate their video-on-demand server. Their model is a cluster of workstation servers, connected by a network to video-display terminals. The terminals just have a circular buffer queue that they fill by making requests to the server, and drain by uncompressing MPEG and displaying video. The servers manage a buffer pool and a set of striped disks. All videos are striped across all disks. They use dual LRU lists in the server buffer pool: one for used blocks, and one for prefetched blocks (``love prefetching''). They use a ``real-time'' disk scheduling algorithm that prioritizes requests by their deadlines (or anticipated deadline in case of a prefetch). Their metric is maximum number of terminals that can be supported without glitches. They plan to implement their system on a workstation cluster.

freitag:visualization:
Lori A. Freitag and Raymond M. Loy. Adaptive, multiresolution visualization of large data sets using a distributed memory octree. In Proceedings of SC99: High Performance Networking and Computing, Portland, OR, November 1999. ACM Press and IEEE Computer Society Press.

Keywords: interactive visualization, multi-resolution visualization, adaptive visualization, scientific application, parallel octrees, pario-bib

Comment: They describe a technique that combines heirarchical data reduction methods with parallel computing to allow "interactive exploration of large data sets while retaining full-resolution capabilities." They point out that visualization of large data sets requires a post-processing step to reduce the size, or sophisticated rendering algorithms that work with the full resolution. There method combines the two techniques.

french:balance:
James C. French. Characterizing the balance of parallel I/O systems. In Proceedings of the Sixth Annual Distributed-Memory Computer Conference, pages 724-727, 1991.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: Proposes the min_SAR, max_SAR, and ratio phi as measures of aggregate file system bandwidth. Has to do with load balance issues; how well the file system balances between competing nodes in a heavy-use period.

french:ipsc2io:
James C. French, Terrence W. Pratt, and Mriganka Das. Performance measurement of a parallel input/output system for the Intel iPSC/2 hypercube. In Proceedings of the 1991 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 178-187, 1991.
See also earlier version french:ipsc2io-tr.
See also later version french:ipsc2io-jpdc.

Keywords: parallel I/O, Intel iPSC/2, pario-bib

french:ipsc2io-jpdc:
James C. French, Terrence W. Pratt, and Mriganka Das. Performance measurement of the Concurrent File System of the Intel iPSC/2 hypercube. Journal of Parallel and Distributed Computing, 17(1-2):115-121, January and February 1993.
See also earlier version french:ipsc2io.

Keywords: parallel I/O, Intel iPSC/2, pario-bib

french:ipsc2io-tr:
James C. French, Terrence W. Pratt, and Mriganka Das. Performance measurement of a parallel input/output system for the Intel iPSC/2 hypercube. Technical Report IPC-TR-91-002, Institute for Parallel Computation, University of Virginia, 1991.
See also later version french:ipsc2io.

Keywords: parallel I/O, Intel iPSC/2, disk caching, prefetching, pario-bib

Comment: Nice study of performance of existing CFS system on 32-node + 4 I/O-node iPSC/2. They show big improvements due to declustering, preallocation, caching, and prefetching. See also pratt:twofs.

galbreath:applio:
N. Galbreath, W. Gropp, and D. Levine. Applications-driven parallel I/O. In Proceedings of Supercomputing '93, pages 462-471, Portland, OR, 1993. IEEE Computer Society Press.
See also later version galbreath:bapplio.

Keywords: parallel I/O, pario-bib

Comment: They give a useful overview of the I/O requirements of many applications codes, in terms of input, output, scratch files, debugging, and checkpointing. They also describe their architecture-independent I/O interface that provides calls to read and write entire arrays, with some flexibility in the format and distribution of the array. Curious centralized control method. Limited performance evaluation. They're trying to keep the I/O media, file layout, and I/O architecture transparent to the user. Implementation decides which processors actually do read/write. Data formatted or unformatted; file sequential or parallel; can specify distributed arrays with ghost points. Runs on lots of platforms; will also be implementing on IBM SP-1 with disk per node, 128 nodes. Their package is freely available via ftp. Future: buffer-size experiments, unstructured data, use parallel file internally and then seqeuentialize on close.

galbreath:bapplio:
Nicholas P. Galbreath, William D. Gropp, and David M. Levine. Applications-driven parallel I/O. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 36, pages 539-547. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version galbreath:applio.

Keywords: parallel I/O, pario-bib

Comment: Part of jin:io-book; reformatted version of galbreath:applio.

ganger:diskarray:
Gregory R. Ganger, Bruce L. Worthington, Robert Y. Hou, and Yale N. Patt. Disk arrays: High performance, high-reliability storage subsystems. IEEE Computer, 27(3):30-36, March 1994.

Keywords: disk array, RAID, parallel I/O, pario-bib, survey

ganger:load-balance:
Gregory R. Ganger, Bruce L. Worthington, Robert Y. Hou, and Yale N. Patt. Disk subsystem load balancing: Disk striping vs. conventional data placement. In Proceedings of the Twenty-Sixth Annual Hawaii International Conference on System Sciences, volume I, pages 40-49, 1993.

Keywords: parallel I/O, disk striping, load balancing, pario-bib

Comment: Using trace-driven simulation to compare dynamic load-balancing techniques in databases that span several disk drives, with the inherent load-balancing of striping. Their traces were from two Oracle databases on two different NCR systems. They found that striping, with its essentially random block-by-block load balancing, does a better job of avoiding short-term load imbalances than the ``manual'' load-balancing does.

garcia:expand-design:
Félix Garcia-Carballeira, Alejandro Calderon, Jesus Carretero, Javier Fernandez, and Jose M. Perez. The design of the Expand parallel file system. The International Journal of High Performance Computing Applications, 17(1):21-38, 2003.

Abstract: This article describes an implementation of MPI-IO using a new parallel file system, called Expand (Expandable Parallel File System), which is based on NFS servers. Expand combines multiple NFS servers to create a distributed partition where files are striped. Expand requires no changes to the NFS server and uses RPC operations to provide parallel access to the same file. Expand is also independent of the clients, because all operations are implemented using RPC and NFS protocols. Using this system, we can join heterogeneous servers (Linux, Solaris, Windows 2000, etc.) to provide a parallel and distributed partition. The article describes the design, implementation and evaluation of Expand with MPI-IO. This evaluation has been made in Linux clusters and compares Expand and PVFS.

Keywords: parallel file system, parallel I/O, pario-bib

garcia:striping-reliability:
Hector Garcia-Molina and Kenneth Salem. The impact of disk striping on reliability. IEEE Database Engineering Bulletin, 11(1):26-39, March 1988.

Keywords: parallel I/O, disk striping, reliability, disk array, pario-bib

Comment: Reliability of striped filesystems may not be as bad as you think. Parity disks help. Performance improvements limited to small number of disks ($n<10$). Good point: efficiency of striping will increase as the gap between CPU/memory performance and disk speed and file size widens. Reliability may be better if measured in terms of performing a task in time T, since the striped version may take less time. This gives disks less opportunity to fail during that period. Also consider the CPU failure mode, and its use over less time.

garg:tflops-pfs:
Sharad Garg. TFLOPS PFS: Architecture and design of a highly efficient parallel file system. In Proceedings of SC98: High Performance Networking and Computing. ACM Press, November 1998.

Abstract: In recent years, many commercial Massively Parallel Processor (MPP) systems have been available to the computing community. These systems provide very high processing power (up to hundreds of GFLOPs), and can scale efficiently with the number of processors. However, many scientific and commercial applications that run on these multiprocessors may not experience significant benefit in terms of speedup and are bottlenecked by their I/O requirements. Although these multiprocessors may be configured with sufficient I/O hardware, the file system software often fails to provide the available I/O bandwidth to the application, and causes severe performance degradation for I/O intensive applications.

A highly efficient parallel file system has been implemented on Intel's Teraflops (TFLOPS) machine and provides a sustained I/O bandwidth of 1 GB/sec. This file system provides almost 95% of the available raw hardware I/O bandwidth and the I/O bandwidth scales proportional to the available I/O nodes.

Intel's TFLOPS machine is the first Accelerated Strategic Computing Initiative (ASCI) machine that DOE has acquired. This computer is 10 times more powerful than the fastest machine today, and will be used primarily to simulate nuclear testing and to ensure the safety and effectiveness of the nation's nuclear weapons stockpile.

This machine contains over 9000 Intel's Pentium Pro processors, and will provide a peak CPU performance of 1.8 teraflops. This papers presents the I/O design and architecture of Intel's TFLOPS supercomputer, describes the Cougar OS I/O and its interface with the Intel's Parallel File System.

Keywords: parallel file system, intel, ASCI Red, pario-bib

Comment: Describes the parallel file system for ASCI Red. The paper is only available as HTML

gava:parallel-ml:
Frédéric Gava. Parallel I/O in bulk-synchronous parallel ML. Lecture Notes in Computer Science, 3038:331-338, June 2004.

Abstract: Bulk Synchronous Parallel ML or BSML is a functional data-parallel language for programming bulk synchronous parallel (BSP) algorithms. The execution time can be estimated and dead-locks and indeterminism are avoided. For large scale applications where parallel processing is helpful and where the total amount of data often exceeds the total main memory available, parallel disk I/O becomes a necessity. We present here a library of I/O features for BSML and its cost model.

Keywords: parallel I/O, parallel ML, BSML, data parallel language, pario-bib

gennart:bcomparing:
Benoit A. Gennart and Roger D. Hersch. Comparing multimedia storage architectures. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 37, pages 548-554. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version gennart:comparing.

Keywords: parallel I/O, multimedia, pario-bib

Comment: Part of jin:io-book; reformatted version of gennart:comparing.

gennart:comparing:
Benoit A. Gennart and Roger D. Hersch. Comparing multimedia storage architectures. In Proceedings of the International Conference on Multimedia Computing and Systems, pages 323-328. 1995.
See also later version gennart:bcomparing.

Abstract: Multimedia interfaces increase the need for large image databases, capable of storing and reading streams of data with strict synchronicity and isochronicity requirements. In order to fulfil these requirements, we use a parallel image server architecture which relies on arrays of intelligent disk nodes, each disk node being composed of one processor and one or more disks. This contribution analyzes through simulation the real-time behavior of two multiprocessor multi-disk architectures: GigaView and the Unix workstation cluster. GigaView incorporates point-to-point communication between processing units and the workstation cluster supports communication through a shared bus-and-memory architecture. For a standard multimedia server architecture consisting of 8 disks and 4 disk-node processors, we evaluate stream frame access times under various parameters such as load factors, frame size, stream throughput and synchronicity requirements. We compare the behavior of GigaView and the workstation cluster in terms of delay and delay jitter

Keywords: parallel I/O, multimedia, pario-bib

gerner:sp2-io:
Jerry Gerner. Input/output on the IBM SP2- an overview, 1995. Available at \verb+http://www.tc.cornell.edu/SmartNodes/Newsletters/IO.series/intro.html+.

Keywords: parallel I/O, IBM SP2, pario-bib

ghandeharizadeh:bmitra:
Shahram Ghandeharizadeh, Roger Zimmermann, Weifeng Shi, Reza Rejaie, Douglas J. Ierardi, and Ta-Wei Li. Mitra: A scalable continuous media server. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 41, pages 595-613. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version ghandeharizadeh:mitra.

Keywords: multimedia, parallel I/O, pario-bib

Comment: Part of jin:io-book; reformatted version of ghandeharizadeh:mitra.

ghandeharizadeh:mitra:
Shahram Ghandeharizadeh, Roger Zimmermann, Weifeng Shi, Reza Rejaie, Doug Ierardi, and Ta-Wei Li. Mitra- a continuous media server. Multimedia Tools and Applications, 5(1):79-108, July 1998.
See also later version ghandeharizadeh:bmitra.

Abstract: Mitra is a scalable storage manager that supports the display of continuous media data types, e.g., audio and video clips. It is a software based system that employs off-the-shelf hardware components. Its present hardware platform is a cluster of multi-disk workstations, connected using an ATM switch. Mitra supports the display of a mix of media types. To reduce the cost of storage, it supports a hierarchical organization of storage devices and stages the frequently accessed objects on the magnetic disks. For the number of displays to scale as a function of additional disks, Mitra employs staggered striping. It implements three strategies to maximize the number of simultaneous displays supported by each disk. First, the EVEREST file system allows different files (corresponding to objects of different media types) to be retrieved at different block size granularities. Second, the FIXB algorithm recognizes the different zones of a disk and guarantees a continuous display while harnessing the average disk transfer rate. Third, Mitra implements the Grouped Sweeping Scheme (GSS) to minimize the impact of disk seeks on the available disk bandwidth.

In addition to reporting on implementation details of Mitra, we present performance results that demonstrate the scalability characteristics of the system. We compare the obtained results with theoretical expectations based on the bandwidth of participating disks. Mitra attains between 65% to 100% of the theoretical expectations.

Keywords: multimedia, parallel I/O, pario-bib

Comment: This paper describes the continous media server Mita. Mita runs on a cluster of multi-disk HP 9000/735 workstations. Each workstation consists of 80 Mbytes of memory and four disks. They implement ''staggered striping'' of the data in which disks are clustered based on media type and treated as a single logical unit. Data is then striped across the logical disk cluster in a round-robin fashion. They present performance results as a function of total number of disks and the number of disks in a cluster.

ghandeharizadeh:servers:
Shahram Ghandeharizadeh and Richard Muntz. Design and implementation of scalable continuous media servers. Parallel Computing, 24(1):91-122, January 1998.

Keywords: parallel I/O, multimedia, pario-bib

Comment: Part of a special issue.

ghemawat:googlefs:
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pages 96-108, Bolton Landing, NY, October 2003. ACM Press.

Abstract: We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to re-examine traditional choices and explore radically different design points.

The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.

In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

Keywords: distributed file system, pario-bib

ghosh:hyper:
Joydeep Ghosh, Kelvin D. Goveas, and Jeffrey T. Draper. Performance evaluation of a parallel I/O subsystem for hypercube multiprocessors. Journal of Parallel and Distributed Computing, 17(1-2):90-106, January and February 1993.

Keywords: parallel I/O, MIMD, multiprocessor architecture, hypercube, pario-bib

Comment: Given a hypercube that has I/O nodes scattered throughout, they compare a plain one to one that has the I/O nodes also interconnected with a half-size hypercube. They show that this has better performance because the I/O traffic does not interfere with normal inter-PE traffic. See also ghosh:pario.

ghosh:pario:
Joydeep Ghosh and Bipul Agarwal. Parallel I/O subsystems for distributed-memory multiprocessors. In Proceedings of the Fifth International Parallel Processing Symposium, pages 381-384, 1991.

Keywords: multiprocessor file system, parallel I/O, pario-bib

Comment: They simulate a 128-node hypercube with 16 I/O nodes attached at uniformly distributed points. They compare two architectures: one with a separate I/O network, and another without a separate I/O network. When there, the extra network is used to route I/O packets from the originating I/O node to the I/O node closest to the destination processing node (or vice versa). They run simulations under workloads with differing amounts of locality, and experiment with different bandwidths for the links. They conclude that the extra network helps. But they never make the (proper, fair) comparison where the total network bandwidth is held constant. See also ghosh:hyper.

gibson:arrays:
Garth A. Gibson. Designing disk arrays for high data reliability. Journal of Parallel and Distributed Computing, 17(1-2):4-27, January/February 1993.

Keywords: parallel I/O, RAID, redundancy, reliability, pario-bib

gibson:book:
Garth A. Gibson. Redundant Disk Arrays: Reliable, Parallel Secondary Storage. An ACM Distinguished Dissertation 1991. MIT Press, 1992.

Keywords: parallel I/O, disk array, disk striping, reliability, RAID, pario-bib

Comment: Excellent book. Good source for discussion of the access gap and transfer gap, disk lifetimes, parity methods, reliability analysis, and generally the case for RAIDs. Page 220 he briefly discusses multiprocessor I/O architecture.

gibson:bstorage:
Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang, Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg, and Jim Zelenka. A cost-effective, high-bandwidth storage architecture. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 28, pages 431-444. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version gibson:storage.

Keywords: network-attached storage, storage architecture, parallel I/O, pario-bib

Comment: Part of jin:io-book; reformatted version of gibson:storage.

gibson:dram:
Garth A. Gibson, R. Hugo Patterson, and M. Satyanarayanan. Disk reads with DRAM latency. In Third Workshop on Workstation Operating Systems, pages 126-131, 1992.
See also later version patterson:informed.

Abstract: The most difficult and frequently most important challenge for high performance file access is the achievement of low latency cache misses. We propose to explore the utility and feasibility of using file access hints to schedule overlapped prefetching of file data. Hints may be issued explicitly by programmers, automatically by compilers, speculatively by parent tasks such as shells and makes, or historically by previously profiled executions. Our research will also address the thorny issues of hint specification, memory resource management, imprecise and incorrect hints, and appropriate interfaces for propogating hints through and to effected application, operating system, file system, and device specific modules. We begin our research with a detailed examination of two applications with large potential for improvement: compilation of multiple module software systems and scientific simulation using very large grid state files.

Keywords: file system, prefetching, pario-bib

Comment: A relatively early TIP report with nothing really new over patterson:tip.

gibson:failcorrect:
Garth A. Gibson, Lisa Hellerstein, Richard M. Karp, Randy H. Katz, and David A. Patterson. Failure correction techniques for large disk arrays. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 123-132, April 1989.
See also earlier version gibson:raid.
See also later version gibson:bfailcorrect.

Abstract: The ever increasing need for I/O bandwidth will be met with ever larger arrays of disks. These arrays require redundancy to protect against data loss. This paper examines alternative choices for encodings, or codes, that reliably store information in disk arrays. Codes are selected to maximize mean time to data loss or minimize disks containing redundant data, but are all constrained to minimize performance penalties associated with updating information or recovering from catastrophic disk failures. We also codes that give highly reliable data storage with low redundant data overhead for arrays of 1000 information disks.

Keywords: parallel I/O, disk array, RAID, reliability, pario-bib

Comment: See gibson:raid for comments since it is the same.

gibson:nasd-scaling:
Garth A. Gibson, David F. Nagle, Khalil Amiri, Fay W. Chang, Eugene M. Feinberg, Howard Gobioff, Chen Lee, Berend Ozceri, Erik Riedel, David Rochberg, and Jim Zelenka. File server scaling with network-attached secure disks.. Performance Evaluation Review, 25(1):272 - 84, 1997.

Abstract: By providing direct data transfer between storage and client, network-attached storage devices have the potential to improve scalability for existing distributed file systems (by removing the server as a bottleneck) and bandwidth for new parallel and distributed file systems (through network striping and more efficient data paths). Together, these advantages influence a large enough fraction of the storage market to make commodity network-attached storage feasible. Realizing the technology's full potential requires careful consideration across a wide range of file system, networking and security issues. This paper contrasts two network-attached storage architectures-(1) Networked SCSI disks (NetSCSI) are network attached storage devices with minimal changes from the familiar SCSI interface, while (2) Network-Attached Secure Disks (NASD) are drives that support independent client access to drive object services. To estimate the potential performance benefits of these architectures, we develop an analytic model and perform trace-driven replay experiments based on AFS and NFS traces. Our results suggest that NetSCSI can reduce file server load during a burst of NFS or AFS activity by about 30%. With the NASD architecture, server load (during burst activity) can be reduced by a factor of up to five for AFS and up to ten for NFS.

Keywords: NASD, network-attached disks, distributed file system, parallel file system, security, secure disks, pario-bib

Comment: Essentially, the conference (and subsequent) journal version of gibson:nasd-tr. The studies that use simple analytical models (based on measured workloads of NFS and AFS file managers) to compare performance of NASD to SAD (storage-attached disks) and NetSCSI are often cited as justification for the NASD and object-based storage approaches.

gibson:nasd-tr:
Garth A. Gibson, David P. Nagle, Khalil Amiri, Fay W. Chang, Eugene Feinberg, Howard Gobioff Chen Lee, Berend Ozceri, Erik Riedel, and David Rochberg. A case for network-attached secure disks. Technical Report CMU-CS-96-142, Carnegie-Mellon University, June 1996.

Abstract: By providing direct data transfer between storage and client, network-attached storage devices have the potential to improve scalability (by removing the server as a bottleneck) and performance (through network striping and shorter data paths). Realizing the technology's full potential requires careful consideration across a wide range of file system, networking and security issues. To address these issues, this paper presents two new network-attached storage architectures. (1) Networked SCSI disks (NetSCSI) are network-attached storage devices with minimal changes from the familiar SCSI interface (2) Network-attached secure disks (NASD) are drives that support independent client access to drive provided object services. For both architectures, we present a sketch of repartitionings of distributed file system functionality, including a security framework whose strongest levels use tamper resistant processing in the disks to provide action authorization and data privacy even when the drive is in a physically insecure location.

Using AFS and NFS, trace results suggest that NetSCSI can reduce file server load during a burst of AFS activity by a factor of about 2; for the NASD architecture, server load (during burst activity) can be reduced by a factor of about 4 for AFS and 10 for NFS.

Keywords: parallel I/O, network attached storage, distributed file systems, computer security, network attached secure disks, NASD, capability system, pario-bib

Comment: They outline their rationale for the idea of Network-attached Secure Disks (NASD). Basically the idea is to develop disk drives that attach right to the LAN, rather than to a file server, and allow clients to access the disks directly for many of the simpler file system actions (read and write file data, read file attributes), and only contact the server for more complex activities (opening and creating files, changing attributes). This removes the load from file servers, which are getting too slow to move large amounts of data needed by large installations. Issues include security, of course, which they solve with encryption (for privacy) and time-limited capabilities (keys) given out by the server to authenticated clients, which the clients show to the disk to gain access. They compare the performance of NASD, using a simple analytical model and parameters obtained from measuring real NFS and AFS implementations, to the performance of SAD (server-attached disks) and NetSCSI (a hybrid approach that involves the server in every operation but allows data to flow directly from disk to and from the network).

gibson:raid:
Garth Gibson, Lisa Hellerstein, Richard Karp, Randy Katz, and David Patterson. Coding techniques for handling failures in large disk arrays. Technical Report UCB/CSD 88/477, UC Berkeley, December 1988.
See also later version gibson:failcorrect.

Keywords: parallel I/O, RAID, reliability, disk array, pario-bib

Comment: Design of parity encodings to handle more than one bit failure in any group. Their 2-bit correcting codes are good enough for 1000-disk RAIDs that 3-bit correction is not needed.

gibson:raidframe-tr:
Garth A. Gibson, William V. Courtright II, Mark Holland, and Jim Zelenka. RAIDframe: Rapid prototyping for disk arrays. Technical Report CMU-CS-95-200, Carnegie Mellon University, October 1995.
See also later version courtright:raidframe.

Keywords: parallel I/O, RAID, disk array, reliability, simulation, pario-bib

Comment: Short version appeared as courtright:raidframe. Pretty neat idea. They provide a way to express the sequence of disk-access operations in a RAID controller using directed acyclic graphs, and a library that can `execute' these graphs either in a simulation or in an software-RAID implementation. The big benefit is that it is faster, easier, and less error-prone to implement various RAID management policies.

gibson:scotch-tr:
Garth A. Gibson, Daniel Stodolsky, Pay W. Chang, William V. Courtright II, Chris G. Demetriou, Eka Ginting, Mark Holland, Qingming Ma, LeAnn Neal, R. Hugo Patterson, Jiawen Su, Rachad Youssef, and Jim Zelenka. The Scotch parallel storage systems. Technical Report CMU-CS-95-107, Carnegie Mellon University, January 1995.
See also later version gibson:scotch1.

Keywords: parallel I/O, RAID, disk array, multiprocessor file system, file prefetching, file caching, cache consistency, pario-bib

gibson:scotch1:
Garth A. Gibson, Daniel Stodolsky, Pay W. Chang, William V. Courtright II, Chris G. Demetriou, Eka Ginting, Mark Holland, Qingming Ma, LeAnn Neal, R. Hugo Patterson, Jiawen Su, Rachad Youssef, and Jim Zelenka. The Scotch parallel storage systems. In Proceedings of 40th IEEE Computer Society International Conference (COMPCON 95), pages 403-410, San Francisco, Spring 1995.
See also earlier version gibson:scotch-tr.

Keywords: parallel I/O, RAID, disk array, multiprocessor file system, file prefetching, file caching, cache consistency, pario-bib

Comment: An overview of research being done in Garth's group. Touches on work in RAID disk arrays, parallel file systems, and prefetching. I think gibson:scotch-tr is nearly the same.

gibson:sdcr:
Garth A. Gibson, Jeffrey Scott Vitter, and John Wilkes. Strategic directions in storage I/O issues in large-scale computing. ACM Computing Surveys, 28(4):779-793, December 1996.

Abstract: We discuss the strategic directions and challenges in the management and use of storage systems-those components of computer systems responsible for the storage and retrieval of data. The performance gap between main and secondary memories shows no imminent sign of vanishing, and thus continuing research into storage I/O will be essential to reap the full benefit from the advances occurring in many other areas of computer science. In this report we identify a few strategic research goals and possible thrusts to meet those goals.

Keywords: supercomputing, data storage, database, parallel I/O, pario-bib

Comment: A more reliable, but limited-access, URL is http://www.acm.org/pubs/citations/journals/surveys/1996-28-4/p779-gibson/

gibson:storage:
Garth Gibson, David Nagle, Khalil Amiri, Jeff Butler, Fay Chang, Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg, and Jim Zelenka. A cost-effective high-bandwidth storage architecture. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 92-104. ACM Press, 1998.
See also later version gibson:bstorage.

Keywords: network-attached storage, storage architecture, parallel I/O, pario-bib

golding:attribute:
Richard Golding, Elizabeth Shriver, Tim Sullivan, and John Wilkes. Attribute-managed storage. In Workshop on Modeling and Specification of I/O. At SPDP'95, 1995.

Abstract: Storage systems are continuing to grow, and they are become shared resources with the advent of I/O networks like FibreChannel. Managing these resources to meet performance and resiliency goals is becoming a significant challenge. We believe that completely automatic, attribute-managed storage is the way to address this issue. Our approach is based on declarative specifications of both application workloads and device characteristics. These are combined by a matching engine to generate a load-assignment that provides optimal performance and meets availability guarantees, at minimum cost.

Keywords: I/O architecture, disk array, RAID, file system, storage system, pario-bib

Comment: This is just a 4-page position paper. See also shriver:slides.

golubchik:reducing:
Leana Golubchik, John C. S. Lui, and Richard Muntz. Reducing I/O demand in video-on-demand storage servers. In Proceedings of the 1995 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 25-36, May 1995.

Keywords: video server, multimedia, parallel I/O, pario-bib

Comment: An approach called adaptive piggybacking groups together streams that are watching the same video, but at slightly different times, so that they can share the I/O streams.

golubchik:striping:
Leana Golubchik, Richard R. Muntz, and Richard W. Watson. Analysis of striping techniques in robotic storage libraries. In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 225-238. IEEE Computer Society Press, September 1995.

Abstract: In recent years advances in computational speed have been the main focus of research and development in high performance computing. In comparison, the improvement in I/O performance has been modest. Faster processing speeds have created a need for faster I/O as well as for the storage and retrieval of vast amounts of data. The technology needed to develop these mass storage systems exists today. Robotic storage libraries are vital components of such systems. However, they normally exhibit high latency and long transmission times. We analyze the performance of robotic storage libraries and study striping as a technique for improving response time. Although striping has been extensively studied in the content of disk arrays, the architectural differences between robotic storage libraries and arrays of disks suggest that a separate study of striping techniques in such libraries would be beneficial.

Keywords: mass storage, parallel I/O, pario-bib

golubchik:survey:
Leana Golubchik, John C.S. Lui, and Maria Papadopouli. A survey of approaches to fault tolerant design of VOD servers: Techniques, analysis and comparison. Parallel Computing, 24(1):123-155, January 1998.

Keywords: parallel I/O, multimedia, survey, pario-bib

Comment: Part of a special issue.

goodrich:external:
Michael T. Goodrich, Jyh-Jong Tsay, Darren E. Vengroff, and Jeffrey Scott Vitter. External-memory computational geometry. In Proceedings of the 34th Annual Symposium on Foundations of Computer Science, pages 714-723, November 1993.

Abstract: In this paper, we give new techniques for designing efficient algorithms for computational geometry problems that are too large to be solved in internal memory, and we use these techniques to develop optimal and practical algorithms for a number of important large-scale problems in computational geometry. Our algorithms are optimal for a wide range of two-level and hierarchical multilevel memory models, including parallel models. The algorithms are optimal in terms of both I/O cost and internal computation.

Our results are built on four fundamental techniques: distribution sweeping, a generic method for externalizing plane-sweep algorithms; persistent B-trees, for which we have both on-line and off-line methods; batch filtering, a general method for performing $K$ simultaneous external-memory searches in any data structure that can be modeled as a planar layered dag; and external marriage-before-conquest, an external-memory analog of the well-known technique of Kirkpatrick and Seidel. Using these techniques we are able to solve a very large number of problems in computational geometry, including batched range queries, 2-d and 3-d convex hull construction, planar point location, range queries, finding all nearest neighbors for a set of planar points, rectangle intersection/union reporting, computing the visibility of segments from a point, performing ray-shooting queries in constructive solid geometry (CSG) models, as well as several geometric dominance problems.

These results are significant because large-scale problems involving geometric data are ubiquitous in spatial databases, geographic information systems (GIS), constraint logic programming, object oriented databases, statistics, virtual reality systems, and graphics. This work makes a big step, both theoretically and in practice, towards the effective management and manipulation of geometric data in external memory, which is an essential component of these applications.

Keywords: computational geometry, parallel I/O algorithm, pario-bib

gopinath:3tier:
K. Gopinath, Nitin Muppalaneni, N. Suresh Kumar, and Pankaj Risbood. A 3-tier RAID storage system with RAID1, RAID5, and compressed RAID5 for Linux. In Proceedings of the FREENIX Track at the 2000 USENIX Annual Technical Conference, pages 21-34. USENIX Association, 2000.

Keywords: parallel I/O, RAID, disk array, pario-bib

gotwals:pario:
Jacob Gotwals, Suresh Srinivas, and Shelby Yang. Parallel I/O from the user's perspective. In Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pages 129-137, 1995.

Keywords: parallel I/O, pario-bib

gotwals:streams:
Jacob Gotwals, Suresh Srinivas, and Dennis Gannon. pC++/streams: a library for I/O on complex distributed data structures. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 11-19, Santa Barbara, CA, July 1995. ACM Press.

Keywords: parallel I/O, object-oriented, distributed data structures, runtime library, pario-bib

Comment: URL is for tech report version. They have a language called pC++ that allows object-parallel programming. They have a library called d/streams for I/O of distributed arrays. pC++/streams is a combination. You open a file, specify the in-memory distribution, read from the stream, and then extract some variables. Likewise, you insert some variables (into the stream buffer), then write it. They manage the distribution, and they store necessary metadata to reassemble the data structure when reading. Variables can be arbitrary classes, with $>>$ and $<<$ overloaded as the insert and extract operators. Performance is reasonable on Intel Paragon and SGI Challenge.

gray:infinite:
Jim Gray. What happens when processing, storage, and bandwidth are free and infinite?. Keynote address at IOPADS '97, November 1997.

Abstract: Technology trends promise to give us processors with pico-second clock speeds. These pico-processors will spend much of their time waiting for information from the storage hierarchy. I believe this will force us to adopt a data-flow programming model. Similar trends will bless us with peta-byte online stores with exa-byte near-line stores. One large disk manufacture claims it costs 8$/year to manage a megabyte of online storage. That is 8 Billion dollars per year to manage a petabyte. Automating storage management is one of our major challenges. This talk covers these technology trends, surveys the current status of commercial software tools (aka database systems), their peak performance and price performance. It then poses four major challenges: total-cost-of ownership, long-term archiving, reliably storing exabytes, and data mining on petabyte databases.

Keywords: parallel computing, computer architecture, parallel I/O, pario-bib, memory hierarchy, distributed computing, database, object oriented

Comment: Very interesting talk. URL points to PowerPoint slides.

gray:stripe:
Jim Gray, Bob Horst, and Mark Walker. Parity striping of disk arrays: Low-cost reliable storage with acceptable throughput. In Proceedings of the 16th VLDB Conference, pages 148-159, 1990.

Keywords: disk striping, reliability, pario-bib

Comment: Parity striping, a variation of RAID 5, is just a different way of mapping blocks to disks. It groups parity blocks into extents, and does not stripe the data blocks. A logical disk is mostly contained in one physical disk, plus a parity region in another disk. Good for transaction processing workloads. Has the low cost/GByte of RAID, the reliability of RAID, without the high transfer rate of RAID, but with much better requests/second throughput than RAID 5. (But 40% worse than mirrors.) So it is a compromise between RAID and mirrors. BUT, see mourad:raid.

grimshaw:ELFSTR:
Andrew S. Grimshaw and Edmond C. Loyot, Jr. ELFS: object-oriented extensible file systems. Technical Report TR-91-14, Univ. of Virginia Computer Science Department, July 1991.
See also later version grimshaw:elfs.

Keywords: parallel I/O, parallel file system, object-oriented, file system interface, Intel iPSC/2, pario-bib

Comment: See also grimshaw:elfs. They hope to provide high bandwidth and low latency, reduce the cognitive burden on the programmer, and manage proliferation of data formats and architectural changes. Details of the plan to make an extensible OO interface to file system. Objects each have a separate thread of control, so they can do asynchronous activity like prefetching and caching in the background, and support multiple outstanding requests. The Mentat object system makes it easy for them to support pipelining of I/O with I/O and computation in the user program. Let the user choose type of consistency needed. See grimshaw:objects for more results.

grimshaw:elfs:
Andrew S. Grimshaw and Edmond C. Loyot, Jr. ELFS: object-oriented extensible file systems. In Proceedings of the First International Conference on Parallel and Distributed Information Systems, page 177, December 1991.
See also earlier version grimshaw:ELFSTR.
See also later version grimshaw:objects.

Keywords: parallel I/O, parallel file system, object-oriented, file system interface, pario-bib

Comment: Full paper grimshaw:ELFSTR. Really neat idea. Uses OO interface to file system, which is mostly in user mode. The object classes represent particular access patterns (e.g., 2-D matrix) in the file, and hide the actual structure of the file. The object knows enough to taylor the cache and prefetch algorithms to the semantics. Class inheritance allows layering.

grimshaw:objects:
Andrew S. Grimshaw and Jeff Prem. High performance parallel file objects. In Proceedings of the Sixth Annual Distributed-Memory Computer Conference, pages 720-723, 1991.

Keywords: parallel I/O, multiprocessor file system, file system interface, pario-bib

Comment: Not much new from ELFS TR. A better citation than grimshaw:ELFS though. Does give CFS performance results. Note on 721 he says that CFS prefetches into ``local memory from which to satisfy future user requests that never come.'' This happens if the local access pattern isn't purely sequential, as in an interleaved pattern.

gropp:io-redundancy:
William D. Gropp, Robert Ross, and Neill Miller. Providing efficient I/O redundancy in MPI environments. Lecture Notes in Computer Science, 3241:77-86, November 2004.

Abstract: Highly parallel applications often use either highly parallel file systems or large numbers of independent disks. Either approach can provide the high data rates necessary for parallel applications. However, the failure of a single disk or server can render the data useless. Conventional techniques, such as those based on applying erasure correcting codes to each file write, are prohibitively expensive for massively parallel scientific applications because of the granularity of access at which the codes are applied. In this paper we demonstrate a scalable method for recovering from single disk failures that is optimized for typical scientific data sets. This approach exploits coarser-grained (but precise) semantics to reduce the overhead of constructing recovery data and makes use of parallel computation (proportional to the data size and independent of number of processors) to construct data. Experiments are presented showing the efficiency of this approach on a cluster with independent disks, and a technique is described for hiding the creation of redundant data within the MPI-IO implementation.

Keywords: fault-tolerance, single-disk failures, MPI-IO, pario-bib

gropp:mpi2:
William Gropp, Ewing Lusk, and Rajeev Thakur. Using MPI-2: Advanced Features of the Message-Passing Interface. MIT Press, Cambridge, MA, 1999.

Abstract: The Message Passing Interface (MPI) specification is widely used for solving significant scientific and engineering problems on parallel computers. There exist more than a dozen implementations on computer platforms ranging from IBM SP-2 supercomputers to clusters of PCs running Windows NT or Linux ("Beowulf" machines). The initial MPI Standard document, MPI-1, was recently updated by the MPI Forum. The new version, MPI-2, contains both significant enhancements to the existing MPI core and new features.

Using MPI is a completely up-to-date version of the authors' 1994 introduction to the core functions of MPI. It adds material on the new C++ and Fortran 90 bindings for MPI throughout the book. It contains greater discussion of datatype extents, the most frequently misunderstood feature of MPI-1, as well as material on the new extensions to basic MPI functionality added by the MPI-2 Forum in the area of MPI datatypes and collective operations.

Using MPI-2 covers the new extensions to basic MPI. These include parallel I/O, remote memory access operations, and dynamic process management. The volume also includes material on tuning MPI applications for high performance on modern MPI implementations.

Keywords: parallel computing, message passing, parallel I/O, multiprocessor file system interface, pario-bib

Comment: Has a large chapter on MPI-IO with lots of example programs.

gross:io:
Thomas Gross and Peter Steenkiste. Architecture implications of high-speed I/O for distributed-memory computers. In Proceedings of the 8th ACM International Conference on Supercomputing, pages 176-185, Manchester, UK, July 1994. ACM Press.

Keywords: parallel I/O, parallel architecture, networking, pario-bib

Comment: They examine the characteristics of a system that has I/O nodes which interface between the internal interconnection network of a distributed-memory MIMD machine and some external network, such as HIPPI. They build a simple model to show how different components affect the I/O throughput. They show the performance of their iWarp-HIPPI interface. They conclude that the I/O nodes must have sufficient memory bandwidth to support multiple data streams coming from several compute nodes, being combined into a single faster external network, or vice versa. They need to support scatter/gather, because the data is often distributed in small pieces. For the same reason, they need to have low per-message overhead. The internal network routing must allow multiple paths between compute nodes and the I/O nodes, to avoid congestion.

grossi:crosstrees:
Roberto Grossi and Giuseppe F. Italiano. Efficient cross-trees for external memory. In Abello and Vitter [abello:dimacs], pages 87-106.

Keywords: out-of-core algorithm, data structure, pario-bib

Comment: See also the component papers vitter:survey, arge:lower, crauser:segment, grossi:crosstrees, toledo:survey. Not clear to what extent these papers are about *parallel* I/O.

grossman:blibrary:
Robert Grossman, Xiao Qin, Wen Xu, Harry Hulen, and Terry Tyler. An architecture for a scalable high-performance digital library. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 39, pages 566-575. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version grossman:library.

Keywords: mass storage, parallel I/O, pario-bib

Comment: Part of jin:io-book; reformatted version of grossman:library.

grossman:library:
R. Grossman, X. Qin, W. Xu, H. Hulen, and T. Tyler. An architecture for a scalable high-performance digital library. In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 89-98. IEEE Computer Society Press, September 1995.
See also later version grossman:blibrary.

Abstract: Requirements for a high-performance, scalable digital library of multimedia data are presented together with a layered architecture for a system that addresses the requirements. The approach is to view digital data as persistent collections of complex objects and to use lightweight object management to manage this data. To scale as the amount of data increases, the object management component is layered over a storage management component. The storage management component supports hierarchical storage, third-party data transfer and parallel input-output. Several issues that arise from the interface between the storage management and object management components are discussed. The authors have developed a prototype of a digital library using this design. Two key components of the prototype are AIM Net and HPSS. AIM Net is a persistent object manager and is a product of Oak Park Research. HPSS is the High Performance Storage System, developed by a collaboration including IBM Government Systems and several national labs.

Keywords: mass storage, parallel I/O, pario-bib

gupta:generating:
Sandeep K. S. Gupta, Zhiyong Li, and John H. Reif. Generating efficient programs for two-level memories from tensor-products. In Proceedings of the Seventh IASTED/ISMM International Conference on Parallel and Distributed Computing and Systems, pages 510-513, Washington, D.C., October 1995.

Abstract: This paper presents a framework for synthesizing efficient out-of-core programs for block recursive algorithms such as the fast Fourier transform (FFT) and Batcher's bitonic sort. The block recursive algorithms conside red in this paper are described using tensor (Kronecker) product and other matrix operations. The algebraic properties of the matrix representation are used to derive efficient out-of-core programs. These programs are targeted towards a two-level disk model which allows HPF supported cyclic(B) data distribution on a disk array. The effectiveness of our approach is demonstrated through an example out-of-core FFT program implemented on a work-station.

Keywords: parallel I/O algorithm, pario-bib

hack:ncar:
James J. Hack, James M. Rosinski, David L. Williamson, Byron A. Boville, and John E. Truesdale. Computational design of the NCAR community climate model. Parallel Computing, 21:1545-1569, 1995.

Keywords: parallel computing, scientific computing, weather prediction, global climate model, parallel I/O, pario-bib

Comment: There is some discussion of I/O issues. This weather code does some out-of-core work, to communicate data between time steps. They also dump a 'history' file every simulated day, and periodic checkpoint files. They are flexible about the layout of the history file, assuming postprocessing will clean it up. The I/O is not too much trouble on the Cray C90, where they get 350 MBps to the SSD for the out-of-core data. The history I/O is no problem. On distributed-memory machines with no SSD, out-of-core was impractical and the history file was only written once per simulated month. 'The most significant weakness in the distributed-memory implementation is the treatment of I/O, [due to] file system maturity....' See hammond:atmosphere and jones:skyhi in the same issue.

hacker:effects:
Thomas J. Hacker, Brian Noble, and Brian D. Athey. The effects of systemic packet loss on aggregate TCP flows. In Proceedings of SC2002: High Performance Networking and Computing, Baltimore, MD, November 2002.

Keywords: network congestion, parallel tcp streams, transport protocols, pario-bib

hacker:fairness:
Thomas J. Hacker, Brian Noble, and Brian D. Athey. Improving throughput and mantaining fairness using parallel TCP. In The 23rd Conference on the IEEE Communications Society (INFOCOM), Hong Kong, March 2004. IEEE Computer Society Press.

Keywords: network congestion, parallel tcp streams, fairness, transport protocols, pario-bib

Comment: Also see earlier hacker:parallel-tcp and hacker:effects

hacker:parallel-tcp:
Thomas J. Hacker, Brian D. Athey, and Brian Noble. The end-to-end performance effects of parallel TCP sockets on a lossy wide-area network.. In Proceedings of the International Parallel and Distributed Processing Symposium, pages 434-443, Fort Lauderdale, Florida, April 2002. IEEE Computer Society Press.

Abstract: This paper examines the effects of using parallel TCP flows to improve end-to-end network performance for distributed data intensive applications. A series of transmission experiments were conducted over a wide-area network to assess how parallel flows improve throughput, and to understand the number of flows necessary to improve throughput while avoiding congestion. An empirical throughput expression for parallel flows based on experimental data is presented, and guidelines for the use of parallel flows are discussed. (45 refs.)

Keywords: network congestion, parallel tcp streams, transport protocols, pario-bib

hadimioglu:fs:
Haldun Hadimioglu and Robert J. Flynn. The architectural design of a tightly-coupled distributed hypercube file system. In Proceedings of the Fourth Conference on Hypercube Concurrent Computers and Applications, pages 147-150, Monterey, CA, 1989. Golden Gate Enterprises, Los Altos, CA.

Keywords: hypercube, multiprocessor file system, pario-bib

Comment: An early paper describing a proposed file system for hypercubes. The writing is almost impenetrable. Confusing and not at all clear what they propose. See also hadimioglu:hyperfs and flynn:hyper-fs.

hadimioglu:hyperfs:
Haldun Hadimioglu and Robert J. Flynn. The design and analysis of a tightly coupled hypercube file system. In Proceedings of the Fifth Annual Distributed-Memory Computer Conference, pages 1405-1410, 1990.

Keywords: multiprocessor file system, parallel I/O, hypercube, pario-bib

Comment: Describes a hypercube file system based on I/O nodes and processor nodes. A few results from a hypercube simulator. See hadimioglu:fs and flynn:hyper-fs.

hammond:atmosphere:
Steven W. Hammond, Richard D. Loft, John M. Dennis, and Rochard K. Sato. Implementation and performance issues of a massively parallel atmospheric model. Parallel Computing, 21:1593-1619, 1995.

Keywords: parallel computing, scientific computing, weather prediction, global climate model, parallel I/O, pario-bib

Comment: They discuss a weather code that runs on the CM-5. The code writes a history file, dumping some data every timestep, and periodically a restart file. They found that CM-5 Fortran met their needs, although required huge buffers to get much scalability. They want to see a single, shared file-system image from all processors, have the file format be independent of processor count, use portable conventional interface, and have throughput scale with the number of computation processors. See also hack:ncar and jones:skyhi in the same issue.

harry:vipfs:
Michael Harry, Juan Miguel del Rosario, and Alok Choudhary. VIP-FS: A VIrtual, Parallel File System for high performance parallel and distributed computing. In Proceedings of the Ninth International Parallel Processing Symposium, pages 159-164, April 1995. Also appeared in ACM Operating Systems Review 29(3), July 1995 pages 35-48.

Keywords: parallel I/O, parallel file system, heterogeneous, pario-bib

Comment: See delrosario:vipfs-tr for an earlier version. Also appears as NPAC report SCCS-686.

hart:grid:
Leslie Hart, Tom Henderson, and Bernardo Rodriguez. An MPI based scalable runtime system: I/O support for a grid library, 1995 or earlier.

Abstract: In order to attain portability when using message passing on a distributed memory system, a portable message passing system must be used as well as other portable system support services. MPI[1] addresses the message passing problem. To date, there are no standards for system services and I/O. A library developed at NOAA's Forecast Systems Laboratory (FSL) known as the Nearest Neighbor Tool[2] (NNT) provides a high level portable interface to interprocess communications for finite difference approximation numerical weather prediction (NWP) models. In order to achieve portability, MPI is used to support interprocess communications. The other services are provided by the lower level library developed at NOAA/FSL known as the Scalable Runtime System (SRS). The principle focus of this paper is SRS.

Keywords: parallel I/O, runtime library, pario-bib

Comment: They describe the runtime system that supports the Nearest-Neighbor Tool (NNT), which they use to parallelize weather-prediction codes. This paper gives a vague overview of the I/O support. The interface sounds fairly typical, as does the underlying structure (server processes, cache processes, etc). Sounds like it is in its early stages, but is useful for many applications.

hartman:bzebra:
John H. Hartman and John K. Ousterhout. The Zebra striped network file system. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 21, pages 309-329. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version hartman:zebra3.

Keywords: parallel I/O, distributed file system, disk striping, pario-bib

Comment: Part of jin:io-book; reformatted version of hartman:zebra3.

hartman:zebra:
John H. Hartman and John K. Ousterhout. Zebra: A striped network file system. In Proceedings of the USENIX File Systems Workshop, pages 71-78, May 1992.
See also later version hartman:zebra2.

Keywords: disk striping, distributed file system, pario-bib

Comment: Not a parallel file system, but worth comparing to Swift. Certainly, a similar idea could be used in a multiprocessor. Cite hartman:zebra3.

hartman:zebra2:
John H. Hartman and John K. Ousterhout. The Zebra striped network file system. In Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles, pages 29-43, Asheville, NC, 1993. ACM Press.
See also earlier version hartman:zebra.
See also later version hartman:zebra3.

Keywords: file system, disk striping, distributed file system, RAID, log-structured file system, parallel I/O, pario-bib

Comment: Zebra stripes across network servers, but not on a file-by-file basis. Instead they use LFS ideas to stripe a per-client log across all file servers. Each client can then compute a parity block for each strip that it writes. They store ``deltas'', changes in block locations, in with the data, and also send them to the (central) file manager. The file manager and stripe cleaner are key state managers, that keep track of where blocks are located, and of stripe utilizations. Performance numbers limited to small-scale tests. This paper has more details than hartman:zebra, and performance numbers (but not with real workloads or stripe cleaner). Some tricky consistency issues.

hartman:zebra3:
John H. Hartman and John K. Ousterhout. The Zebra striped network file system. ACM Transactions on Computer Systems, 13(3):274-310, August 1995.
See also earlier version hartman:zebra2.
See also later version hartman:bzebra.

Keywords: parallel I/O, distributed file system, disk striping, pario-bib

hatcher:linda:
Philip J. Hatcher and Michael J. Quinn. C*-Linda: A programming environment with multiple data-parallel modules and parallel I/O. In Proceedings of the Twenty-Fourth Annual Hawaii International Conference on System Sciences, pages 382-389, 1991.

Keywords: parallel I/O, Linda, data parallel, nCUBE, parallel graphics, heterogeneous computing, pario-bib

Comment: C*-Linda is basically a combination of C* and C-Linda. The model is that of several SIMD modules interacting in a MIMD fashion through a Linda tuple space. The modules are created using eval, as in Linda. In this case, the compiler statically assigns each eval to a separate subcube on an nCUBE 3200, although they also talk about multiprogramming several modules on a subcube (not supported by VERTEX). They envision having separate modules running on the nCUBE's graphics processors, or having the file system directly talk to the tuple space, to support I/O. They also envision talking to modules elsewhere on a network, e.g., a workstation, through the tuple space. They reject the idea of sharing memory between modules due to the lack of synchrony between modules, and message passing because it is error-prone.

hayes:nCUBE:
John P. Hayes, Trevor N. Mudge, Quentin F. Stout, Stephen Colley, and John Palmer. Architecture of a hypercube supercomputer. In Proceedings of the 1986 International Conference on Parallel Processing, pages 653-660, St. Charles, IL, 1986. IEEE Computer Society Press.

Keywords: hypercube, parallel architecture, nCUBE, pario-bib

Comment: Description of the first nCUBE, the NCUBE/ten. Good historical background about hypercubes. Talks about their design choices. Says a little about the file system - basically just a way of mounting disks on top of each other, within the nCUBE and to other nCUBEs.

hellwagner:pfs:
Hermann Hellwagner. Design considerations for scalable parallel file systems. The Computer Journal, 36(8):741-755, 1993.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: An overview of the issues in designing a parallel file system, along with some early ideas for their own file system. They aim for a general-purpose system, and characterize the workload into three classes: independent, much like a timesharing system; cooperative-agents, like that expected by most current MIMD file systems; and single-agent, for data-parallel programs where a ``master'' process issues single large requests on behalf of many processes. Their design is heavily weighted to the assumption of shared memory, and in particular to a randomized shared memory (like RP3), so they don't worry about locality much. They say little about their interface, although they intend to stick to a Unix interface - and Unix semantics - as much as possible. The file system is essentially represented by a collection of shared data structures and many threads to manipulate those structures.

hemy:gigabit:
Michael Hemy and Peter Steenkiste. Gigabit I/O for distributed-memory machines: Architecture and applications. In Proceedings of Supercomputing '95, San Diego, CA, 1995. IEEE Computer Society Press.

Abstract: Distributed-memory systems have traditionally had great difficulty performing network I/O at rates proportional to their computational power. The problem is that the network interface has to support network I/O for a supercomputer, using computational and memory bandwidth resources similar to those of a workstation. As a result, the network interface becomes a bottleneck. We implemented an architecture for network I/O for the iWarp system with the following two key characteristics: first, application-specific tasks are off-loaded from the network interface to the distributed-memory system, and second, these tasks are performed in close cooperation with the application. The network interface has been used by several applications for over a year. In this paper we describe the network interface software that manages the communication between the iWarp distributed-memory system and the network interface, we validate the main features of our network interface architecture based on application experience, and we discuss how this architecture can be used by other distributed-memory systems.

Keywords: parallel network I/O, pario-bib

Comment: Parallel network I/O on the iWARP. Note proceedings only on CD-ROM and WWW.

henderson:shpio:
Mark Henderson, Bill Nickless, and Rick Stevens. A scalable high-performance I/O system. In Proceedings of the Scalable High-Performance Computing Conference, pages 79-86, 1994.

Keywords: parallel I/O, pario-bib

Comment: Scalable I/O initiative intends to build a testbed. At Argonne, they have a 128-node SP-1 with a high-speed switch. 96 are compute nodes, 32 are I/O nodes (128 MB RAM, 1 GB local disk, FibreChannel port). FibreChannel connects to RS/6000 which has 256 MB RAM, two 80 MB/s busses, and a HIPPI interface to a 220 GB RAID (level 1 or 5) and 6.4 TB tape robot. They run UniTree on all this. They use multiple files to get parallelism. FibreChannel with TCP/IP is the limiting factor. note they are focussing more on the external connectivity issues rather than on the internal file system.

herbst:bottleneck:
Kris Herbst. Trends in mass storage: vendors seek solutions to growing I/O bottleneck. Supercomputing Review, pages 46-49, March 1991.

Keywords: parallel I/O, disk media, optical disk, holographic storage, trends, tape storage, parallel transfer disk, disk striping, pario-bib

Comment: A good overview of the current state-of-the art in March 1991, including particular numbers and vendor names. They discuss disk media (density, rotation, etc.), parallel transfer disks, disk arrays, parity and RAID, HiPPI, tape archives, optical memory, and holographic storage. Rotation speeds can increase as diameter goes down. Density increases are often offset by slower head settling times. Disk arrays will hit their ``heydey'' in the 1990s. Trend toward network-attached storage devices, that don't need a computer as a server.

herland:mpvms:
Bjarne Geir Herland. MPVMS - MasPar virtual memory system. Master's thesis, University of Bergen, Bergen, Norway, July 1992.

Keywords: parallel I/O, virtual memory, SIMD, multiprocessor file system, pario-bib

Comment: He has an MPL (Maspar C) preprocessor that inserts code to allow you to make plural vectors and arrays pageable. The preprocessor inserts checks before every access to see whether you have that data in memory, and if not, to page it in. The preprocessor is supported by a run-time library. No compiler, OS, or hardware mods.

hersch:pixmap:
Roger D. Hersch. Parallel storage and retrieval of pixmap images. In Proceedings of the Twelfth IEEE Symposium on Mass Storage Systems, pages 221-226, 1993.

Keywords: parallel I/O, file system, pario-bib

Comment: Ways to arrange 2-d images on disk arrays that have multiple processors (like Datamesh), so that retrieval time for images or subimages is minimized.

hey:parkbench:
Tony Hey and David Lancaster. The development of Parkbench and performance prediction. The International Journal of High Performance Computing Applications, 14(3):205-215, Fall 2000.

Keywords: parallel I/O benchmarks, MPI-IO, pario-app, pario-bib

hidrobo:autonomic:
Francisco Hidrobo and Toni Cortes. Towards an autonomic storage system to improve parallel I/O.. In Proceedings of the 15th IASTED International Conference on Parallel and Distributed Computing and Systems, pages 122-127, vol 1, Marina del Rey, CA, November 2003. ACTA Press.

Abstract: In this paper, we present a mechanism able to predict the performance a given workload will achieve when running on a given storage device. This mechanism is composed by two modules. The first one is able to reproduce its behavior later on, without a new execution, even when the storage drives or data placement are modified. The second module is a drive modeler that is able to learn how storage drive works in an automatic way, just executing some synthetic tests. Once we have the workload and drive models, we can predict how well that application will perform on the selected storage device or devices or when the data placement is modified. The results presented in this paper will show that this prediction system achieves errors below 10% when compared to the real performance obtained. It is important to notice that the two modules will treat both the application and the storage device as black and will need no previous information about them. (20 refs.)

Keywords: performance prediction, data placement, storage device modeling, parallel I/O, pario-bib

Comment: Could not find a URL. See for proceedings information.

hillis:cm5:
W. Daniel Hillis and Lewis W. Tucker. The CM-5 connection machine: A scalable supercomputer. Communications of the ACM, 36(11):31-40, November 1993.

Keywords: parallel architecture, SIMD, MIMD, parallel I/O, pario-bib

Comment: A good basic citation for the CM-5 architecture. A little bit about I/O.

hirano:deadlock:
Satoshi Hirano, Masaru Kitsuregawa, and Mikio Takagi. A high performance parallel I/O model and its deadlock prevention/avoidance technique on the super database computer (SDC). In Proceedings of the Twenty-Sixth Annual Hawaii International Conference on System Sciences, volume I, pages 21-30, 1993.

Keywords: parallel database, concurrency control, deadlock, parallel I/O, pario-bib

Comment: Most interesting to me in this paper is their discussion of the ``container model,'' in which they claim they allow the processors to be driven by the I/O devices. What it boils down to is a producer-consumer queue of containers, each of which contains a task (some tuples and presumably some instruction about what to do with them). The disks put data into containers and stick them on the queue; the processors repeatedly pull containers (tasks) from the queue and process them. They don't describe the activity of the disks in much detail. See kitsuregawa:sdc.

ho:reorganization:
T. K. Ho and Jack Y. B. Lee. A row-permutated data reorganization algorithm for growing server-less video-on-demand systems. In Proceedings of the Third IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 44-51, Tokyo, Japan, May 2003. IEEE Computer Society Press.

Abstract: Recently, a new server-less architecture is proposed for building low-cost yet scalable video streaming systems. Compare to conventional client-server-based video streaming systems, this server-less architecture does not need any dedicated video server and yet is highly scalable. Video data are distributed among user hosts and these hosts cooperate to stream video data to one another. Thus as new hosts join the system, they also add streaming and storage capacity to absorb the added streaming load. This study investigates the data reorganization problem when growing a server-less video streaming system. Specifically, as video data are distributed among user hosts, these data will need to be redistributed to newly joined hosts to utilize their storage and streaming capacity. This study presents a new data reorganization algorithm that allows controllable tradeoff between data reorganization overhead and streaming load balance.

Keywords: data reorganization, video on demand, video streaming, pario-bib

holland:decluster:
Mark Holland and Garth Gibson. Parity declustering for continuous operation in redundant disk arrays. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 23-35, 1992.
See also later version holland:bdecluster.

Abstract: We describe and evaluate a strategy for declustering the parity encoding in a redundant disk array. This declustered parity organization balances cost against data reliability and performance during failure recovery. It is targeted at highly-available parity-based arrays for use in continuous- operation systems. It improves on standard parity organizations by reducing the additional load on surviving disks during the reconstruction of a failed disk's contents. This yields higher user throughput during recovery, and/or shorter recovery time.

We first address the generalized parity layout problem, basing our solution on balanced incomplete and complete block designs. A software implementation of declustering is then evaluated using a disk array simulator under a highly concurrent workload comprised of small user accesses. We show that declustered parity penalizes user response time while a disk is being repaired (before and during its recovery) less than comparable non-declustered (RAID5) organizations without any penalty to user response time in the fault-free state.

We then show that previously proposed modifications to a simple, single-sweep reconstruction algorithm further decrease user response times during recovery, but, contrary to previous suggestions, the inclusion of these modifications may, for many configurations, also slow the reconstruction process. This result arises from the simple model of disk access performance used in previous work, which did not consider throughput variations due to positioning delays.

Keywords: parity, declustering, striping, disk array, redundancy, reliability, pario-bib

holland:on-line:
Mark Holland, Garth A. Gibson, and Daniel P. Siewiorek. Architectures and algorithms for on-line failure recovery in redundant disk arrays. Journal of Distributed and Parallel Databases, 2(3):295-335, July 1994.

Abstract: The performance of traditional RAID Level 5 arrays is, for many applications, unacceptably poor while one of its constituent disks is non-functional. This paper describes and evaluates mechanisms by which this disk array failure-recovery performance can be improved. The two key issues addressed are the data layout, the mapping by which data and parity blocks are assigned to physical disk blocks in an array, and the reconstruction algorithm, which is the technique used to recover data that is lost when a component disk fails.

The data layout techniques this paper investigates are instantiations of the declustered parity organization, a derivative of RAID Level 5 that allows a system to trade some of its data capacity for improved failure-recovery performance. We show that our instantiations of parity declustering improve the failure-mode performance of an array significantly, and that a parity-declustered architecture is preferable to an equivalent-size multiple-group RAID Level 5 organization in environments where failure-recovery performance is important. The presented analyses also include comparisons to a RAID Level 1 (mirrored disks) approach.

With respect to reconstruction algorithms, this paper describes and briefly evaluates two alternatives stripe-oriented reconstruction and disk-oriented reconstruction, and establishes that the latter is preferable as it provides faster reconstruction. The paper then revisits a set of previously-proposed reconstruction optimizations, evaluating their efficacy when used in conjunction with the disk-oriented algorithm. The paper concludes with a section on the reliability versus capacity trade-off that must be addressed when designing large arrays.

Keywords: parallel I/O, disk array, RAID, redundancy, reliability, pario-bib

holland:recovery:
Mark Holland, Garth A. Gibson, and Daniel P. Siewiorek. Fast, on-line failure recovery in redundant disk arrays. In Proceedings of the 23rd Annual International Symposium on Fault-Tolerant Computing, pages 421-433, 1993.

Abstract: This paper describes and evaluates two algorithms for performing on-line failure recovery (data reconstruction) in redundant disk arrays. It presents an implementation of disk-oriented reconstruction, a data recovery algorithm that allows the reconstruction process to absorb essentially all the disk bandwidth not consumed by the user processes, and then compares this algorithm to a previously proposed parallel stripe-oriented approach. The disk-oriented approach yields better overall failure-recovery performance.

The paper evaluates performance via detailed simulation on two different disk array architectures: the RAID level 5 organization, and the declustered parity organization. The benefits of the disk-oriented algorithm can be achieved using controller or host buffer memory no larger than the size of three disk tracks per disk in the array. This paper also investigates the tradeoffs involved in selecting the size of the disk accesses used by the failure recovery process.

Keywords: parallel I/O, disk array, RAID, redundancy, reliability, pario-bib

holland:thesis:
Mark Holland. On-Line Data Reconstruction in Redundant Disk Arrays. PhD thesis, Carnegie Mellon University, April 1994.

Abstract: There exists a wide variety of applications in which data availability must be continuous, that is, where the system is never taken off-line and any interruption in the accessibility of stored data causes significant disruption in the service provided by the application. Examples include on-line transaction processing systems such as airline reservation systems, and automated teller networks in banking systems. In addition, there exist many applications for which a high degree of data availability is important, but continuous operation is not required. An example is a research and development environment, where access to a centrally-stored CAD system is often necessary to make progress on a design project. These applications and many others mandate both high performance and high availability from their storage subsystems.

Parity-based redundant disk arrays are very attractive storage alternatives for these systems because they offer both low cost per megabyte and high data reliability. Unfortunately such systems exhibit poor availability characteristics; their performance is severely degraded in the presence of a disk failure. This dissertation addresses the design of parity-based redundant disk arrays that offer dramatically higher levels of performance in the presence of failure than systems comprising the current state of the art.

We consider two primary aspects of the failure-recovery problem: the organization of the data and redundancy in the array, and the algorithm used to recover the lost data. We apply results from combinatorial theory to generate data and parity organizations that minimize performance degradation during failure recovery by evenly distributing all failure-induced workload over a larger-than-minimal collection of disks. We develop a reconstruction algorithm that is able to absorb for failure-recovery essentially all of the array's bandwidth that is not absorbed by the application process(es). Additionally, we develop a design for a redundant disk array targeted at extremely high availability through extremely fast failure recovery. This development also demonstrates the generality of the presented techniques.

Keywords: parallel I/O, disk arrays, RAID, redundancy, reliability, pario-bib

Comment: Garth Gibson, advisor.

hou:disk:
Robert Y. Hou, Gregory R. Ganger, Yale N. Patt, and Charles E. Gimarc. Issues and problems in the I/O subsystem, part I - The magnetic disk. In Proceedings of the Twenty-Fifth Annual Hawaii International Conference on System Sciences, pages 48-57, 1992.

Keywords: parallel I/O, pario-bib

Comment: A short summary of disk I/O issues: disk technology, latency reduction, parallel I/O, etc..

hsiao:decluster:
Hui-I Hsiao and David DeWitt. Chained Declustering: A new availability strategy for multiprocessor database machines. In Proceedings of 6th International Data Engineering Conference, pages 456-465, 1990.

Keywords: disk array, reliability, parallel I/O, pario-bib

Comment: Chained declustering has cost like mirroring, since it replicates each block, but has better load increase during failure than mirrors, interleaved declustering, or RAID. (Or parity striping (my guess)). Has reliability between that of mirrors and RAID, and much better than interleaved declustering. Would also be much easier in a distributed environment. See hsiao:diskrep.

hsiao:diskrep:
Hui-I Hsiao and David DeWitt. A performance study of three high availability data replication strategies. In Proceedings of the First International Conference on Parallel and Distributed Information Systems, pages 18-28, December 1991.
See also later version hsiao:diskrep2.

Keywords: disk array, reliability, disk mirroring, parallel I/O, pario-bib

Comment: Compares mirrored disks (MD) with interleaved declustering (ID) with chained declustering (CD). ID and CD found to have much better performance in normal and failure modes. See hsiao:decluster.

hsiao:diskrep2:
Hui-I Hsiao and David DeWitt. A performance study of three high availability data replication strategies. Journal of Distributed and Parallel Databases, 1(1):53-79, January 1993.
See also earlier version hsiao:diskrep.

Keywords: disk array, reliability, disk mirroring, parallel I/O, pario-bib

Comment: See hsiao:diskrep.

hsieh:vod:
Jenwei Hsieh, Mengjou Lin, and Thomas M. Ruwart. Performance of a mass storage system for video-on-demand. Journal of Parallel and Distributed Computing, 30(2):147-167, November 1995.

Keywords: multimedia server, video on demand, pario-bib

hu:brapid-cache:
Yiming Hu, Qing Yang, and Tycho Nightingale. RAPID-Cache- a reliable and inexpensive write cache for disk I/O systems. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 15, pages 211-223. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version hu:rapid-cache.

Keywords: parallel I/O, disk cache, disk striping, disk aray, pario-bib

Comment: Part of jin:io-book; reformatted version of hu:rapid-cache.

hu:rapid-cache:
Yiming Hu, Qing Yang, and Tycho Nightingale. RAPID-Cache- a reliable and inexpensive write cache for disk I/O systems. In Proceedings of the 5th International Symposium on High Performance Computer Architecture, pages 204-213. IEEE Computer Society Press, January 2001.
See also later version hu:brapid-cache.

Abstract: This paper presents a new cache architecture called RAPID-Cache for Redundant, Asymmetrically Parallel, and Inexpensive Disk Cache. A typical RAPID-Cache consists of two redundant write buffers on top of a disk system. One of the buffers is a primary cache made of RAM or NVRAM and the other is a backup cache containing a two level hierarchy: a small NVRAM buffer on top of a log disk. The backup cache has nearly equivalent write performance as the primary RAM cache, while the read performance of the backup cache is not as critical because normal read operations are performed through the primary RAM cache and reads from the backup cache happen only during error recovery periods. The RAPID-Cache presents an asymmetric architecture with a fast-write-fast-read RAM being a primary cache and a fast-write-slow-read NVRAM-disk hierarchy being a backup cache. The asymmetric cache architecture allows cost-effective designs for very large write caches for high-end disk I/O systems that would otherwise have to use dual-copy, costly NVRAM caches. It also makes it possible to implement reliable write caching for low-end disk I/O systems since the RAPID-Cache makes use of inexpensive disks to perform reliable caching. Our analysis and trace-driven simulation results show that the RAPID-Cache has significant reliability/cost advantages over conventional single NVRAM write caches and has great cost advantages over dual-copy NVRAM caches. The RAPID-Cache architecture opens a new dimension for disk system designers to exercise trade-offs among performance, reliability and cost.

Keywords: parallel I/O, disk cache, disk striping, disk aray, pario-bib

hua:annealing:
Kien A. Hua, S. D. Lang, and Wen K. Lee. A decomposition-based simulated annealing technique for data clustering. In Proceedings of the Thirteenth ACM Symposium on Principles of Database Systems, pages 117-128. ACM Press, 1994.

Abstract: It has been demonstrated that simulated annealing provides high-quality results for the data clustering problem. However, existing simulated annealing schemes are memory-based algorithms; they are not suited for solving large problems such as data clustering which typically are too big to fit in the memory space in its entirety. Various buffer replacement policies, assuming either temporal or spatial locality, are not useful in this case since simulated annealing is based on a randomized search process. Poor locality of references will cause the memory to thrash because too many replacements are required. This phenomenon will incur excessive disk accesses and force the machine to run at the speed of the I/O subsystem. In this paper, we formulate the data clustering problem as a graph partition problem (GPP), and propose a decomposition-based approach to address the issue of excessive disk accesses during annealing. We apply the statistical sampling technique to randomly select subgraphs of the GPP into memory for annealing. Both the analytical and experimental studies indicate that the decomposition-based approach can dramatically reduce the costly disk I/O activities while obtaining excellent optimized results.

Keywords: out of core, information retrieval, parallel I/O, pario-bib

huber:bppfs:
James V. Huber, Jr., Christopher L. Elford, Daniel A. Reed, Andrew A. Chien, and David S. Blumenthal. PPFS: A high performance portable parallel file system. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 22, pages 330-343. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version huber:ppfs.

Keywords: parallel file system, parallel I/O, pario-bib

Comment: Part of jin:io-book, revised version of huber:ppfs.

huber:msthesis:
James V. Huber, Jr. PPFS: An experimental file system for high performance parallel input/output. Master's thesis, Department of Computer Science, University of Illinois at Urbana Champaign, February 1995.

Abstract: The I/O problem is described in the context of parallel scientific applications. A user-level input/output library, PPFS, is introduced to address these issues. The design and implementation of PPFS are presented. Some simple performance benchmarks are reported. Experiments on two production-scale applications are given.

Keywords: parallel file system, pario-bib

Comment: He describes the design and implementation of PPFS, along with some experimental results. PPFS is a C++ library and a set of servers that implement a parallel file system on top of unix on a cluster or a Paragon. Interesting features of PPFS include: files are a sequence of records (fixed size or variable size), read_next and read_any operations, a no-extend option to reduce overhead of maintaining file-size information, client and server caching, intermediate caching agents for consistency, prefetching and write behind, and user-defined declustering and indexing policies. User-defined changes actually have to be precompiled into the server programs. Good results in comparison to PFS on the Paragon, though that doesn't say much. They are porting it to the SP-2.

huber:ppfs:
Jay Huber, Christopher L. Elford, Daniel A. Reed, Andrew A. Chien, and David S. Blumenthal. PPFS: A high performance portable parallel file system. In Proceedings of the 9th ACM International Conference on Supercomputing, pages 385-394, Barcelona, July 1995. ACM Press.
See also earlier version huber:ppfs-tr.
See also later version huber:bppfs.

Abstract: Rapid increases in processor performance over the past decade have outstripped performance improvements in input/output devices, increasing the importance of input/output performance to overall system performance. Further, experience has shown that the performance of parallel input/output systems is particularly sensitive to data placement and data management policies, making good choices critical. To explore this vast design space, we have developed a user-level library, the Portable Parallel File System (PPFS), which supports rapid experimentation and exploration. The PPFS includes a rich application interface, allowing the application to advertise access patterns, control caching and prefetching, and even control data placement. PPFS is both extensible and portable, making possible a wide range of experiments on a broad variety of platforms and configurations. Our initial experiments, based on simple benchmarks and two application programs, show that tailoring policies to input/output access patterns yields significant performance benefits, often improving performance by nearly an order of magnitude.

Keywords: parallel file system, parallel I/O, pario-bib

huber:ppfs-scenarios:
Jay Huber, Chris Kuszmaul, Tara Madhyastha, and Chris Elford. Scenarios for the portable parallel file system. Technical report, University of Illinois at Urbana-Champaign, November 1993.

Keywords: parallel file system, parallel I/O, pario-bib

Comment: See also elford:ppfs-tr, huber:ppfs.

huber:ppfs-tr:
Jay Huber, Christopher L. Elford, Daniel A. Reed, Andrew A. Chien, and David S. Blumenthal. PPFS: A high performance portable parallel file system. Technical Report UIUCDCS-R-95-1903, University of Illinois at Urbana Champaign, January 1995.
See also later version huber:ppfs.

Abstract: Rapid increases in processor performance over the past decade have outstripped performance improvements in input/output devices, increasing the importance of input/output performance to overall system performance. Further, experience has shown that the performance of parallel data management policies, making good choices critical. To explore this vast design space, we have developed a user-level library, the Portable Parallel File System (PPFS), which supports rapid experimentation and exploration. The PPFS includes a rich application interface, allowing the application to advertise access patterns, control caching and prefetching, and even control data placement. PPFS is both extensible and portable, making possible a wide range of experiments on a broad variety of platforms and configurations. Our initial experiments, based on on simple benchmarks and two application programs, show that tailoring policies to input/output access patterns yields significant performance benefits, often improving performance by nearly an order of magnitude.

Keywords: parallel file system, pario-bib

Comment: They have built a user-level library that implements a parallel file system on top of a set of vanilla Unix file systems. Their goals include flexibility and portability, so they can use PPFS to explore issues in parallel I/O. They allow the application to have lots of control over data distribution, cache and prefetch policies, etc. They support fixed- and variable-length records. They support client, server, and shared caches. This TR includes syntax and specs for all functions. They include performance for synthetic benchmarks and application codes, compared with Intel Paragon PFS (which is admittedly not a very tough competitor).

hubovskykunz:msthesis:
Rainer Hubovsky and Florian Kunz. Dealing with massive data: from parallel i/o to grid i/o. Master's thesis, Vienna University of Technology, Vienna, Austria, January 2004.

Abstract: Increasing requirements in HPC led to improvements of CPU power, but bandwidth of I/O subsystems does not keep up with the performance of processors any more. This problem is commonly known as the I/O bottleneck. Additionally, new and stimulating data-intensive problems in biology, physics, astronomy, space exploration, and human genom research arise, which bring new high-performance applications dealing with massive data spread over globally distributed storage resources. Therefore research in HPC focuses more on I/O systems: all leading hardware vendors of multiprocessor systems provided powerful concurrent I/O subsystems. In accordance researchers focus on the design of appropriate programming tools and models to take advantage of the available hardware resources. Numerous projects about this topic have appeared, from which a large and unmanageable quantity of publications have come. These publications concern themselves to a large extent with very special problems. Due to the time of their appearance the few overview papers deal with Parallel I/O or Cluster I/O. Substantial progress has been made in these research areas since then. Grid Computing has emerged as an important new field, distinguished from conventional Distributed Computing by its focus on large-scale resource sharing, innovative applications and, in some cases, high-performance orientation. Over the past five years, research and development efforts within the Grid community have produced protocols, services and tools that address precisely the challenges that arise when we try to build Grids, I/O being an important part of it. Therefore our work gives an overview of I/O in HPC.

Keywords: parallel i/o, cluster i/o, grid i/o, distributed computing, pario-bib

Comment: Like stockinger:dictionary, this master's thesis categorizes and describes a large set of parallel I/O-related projects and applications.

husmann:format:
Harlan Edward Husmann. High-speed format conversion and parallel I/O in numerical programs. Master's thesis, Department of Computer Science, Univ. of Illinois at Urbana-Champaign, January 1984. Available as TR number UIUCDCS-R-84-1152.

Keywords: parallel I/O, I/O, pario-bib

Comment: Does FORTRAN format conversion in software in parallel or in hardware, to obtain good speedups for lots of programs. However he found that increasing the I/O bandwidth was the most significant change that could be made in the parallel program.

hwang:pvfs-cache:
In-Chul Hwang, Hojoong Kim, Hanjo Jung, Dong-Hwan Kim, Hojin Ghim, Seung-Ryoul Maeng, and Jung-Wan Cho. Design and implementation of the cooperative cache for PVFS. Lecture Notes in Computer Science, 3036:43-50, June 2004.

Abstract: Recently, there have been many efforts to get high performance in cluster computing with inexpensive PCs connected through high-speed networks. Some of them were to provide high bandwidth and parallelism in file service using a distributed file system. Other researches for distributed file systems include the cooperative cache that reduces servers' load and improves overall performance. The cooperative cache shares file caches among clients so that a client can request a file to another client, not to the server, through inter-client message passing. In various distributed file systems, PVFS (Parallel Virtual File System) provides high performance with parallel I/O in Linux widely used in cluster computing. However, PVFS doesn't support any file cache facility. This paper describes the design and implementation of the cooperative cache for PVFS (Coopc-PVFS). We show the efficiency of Coopc-PVFS in comparison to original PVFS. As a result, the response time of Coopc-PVFS is shorter than or similar to that of original PVFS.

Keywords: PVFS, cooperative cache, pario-bib

hwang:raid-x:
Kai Hwang, Hai Jin, and Roy Ho. RAID-x: A new distributed disk array for I/O-centric cluster computing. In Proceedings of the Ninth IEEE International Symposium on High Performance Distributed Computing, pages 279-287, Pittsburgh, PA, August 2000. IEEE Computer Society Press.
See also later version hwang:braid-x.

Abstract: A new RAID-x (redundant array of inexpensive disks at level x) architecture is presented for distributed I/O processing on a serverless cluster of computers. The RAID-x architecture is based on a new concept of orthogonal striping and mirroring (OSM) across all distributed disks in the cluster. The primary advantages of this OSM approach lie in: (1) a significant improvement in parallel I/O bandwidth, (2) hiding disk mirroring overhead in the background, and (3) greatly enhanced scalability and reliability in cluster computing applications. All claimed advantages are substantiated with benchmark performance results on the Trojans cluster built at USC in 1999. Throughout the paper, we discuss the issues of scalable I/O performance, enhanced system reliability, and striped checkpointing on distributed RAID-x in a serverless cluster environment.

Keywords: parallel I/O, disk array, disk striping, RAID, pario-bib

iannizzotto:avda:
G. Iannizzotto, A. Puliafito, S. Riccobene, and L. Vita. AVDA: A disk array system for multimedia services. In Proceedings of the 1995 International Conference on High Performance Computing, pages 160-165, New Delhi, India, December 1995.

Keywords: disk array, multimedia, parallel I/O, pario-bib

Comment: Petri-net model of disk array using Information-Dispersal Algorithm (IDA) to stripe data. Continuous-media workload.

ibm:sp1:
IBM 9076 Scalable POWERparallel 1: General information. IBM brochure GH26-7219-00, February 1993.

Keywords: multiprocessor architecture, parallel I/O, pario-bib

Comment: See also information about Vesta file system, corbett:vesta.

intel:examples:
Concurrent I/O application examples. Intel Corporation Background Information, 1989.

Keywords: file access pattern, parallel I/O, Intel iPSC/2, hypercube, pario-bib

Comment: Lists several examples and the amount and types of data they require, and how much bandwidth. Fluid flow modeling, Molecular modeling, Seismic processing, and Tactical and strategic systems.

intel:ipsc2io:
iPSC/2 I/O facilities. Intel Corporation, 1988. Order number 280120-001.

Keywords: parallel I/O, hypercube, Intel iPSC/2, pario-bib

Comment: Simple overview, not much detail. See intel:ipsc2, pierce:pario, asbury:fortranio. Separate I/O nodes from compute nodes. Each I/O node has a SCSI bus to the disks, and communicates with other nodes in the system via Direct-Connect hypercube routing.

intel:paragon:
Paragon XP/S product overview. Intel Corporation, 1991.

Keywords: parallel architecture, parallel I/O, Intel, pario-bib

Comment: Not a bad glossy. See also esser:paragon.

intelio:
Intel beefs up its iPSC/2 supercomputer's I/O and memory capabilities. Electronics, November 1988.

Keywords: parallel I/O, hypercube, Intel iPSC/2, pario-bib

iopads-book:
Ravi Jain, John Werth, and James C. Browne, editors. Input/Output in Parallel and Distributed Computer Systems, volume 362 of The Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers, 1996.

Keywords: parallel I/O, parallel I/O architecture, parallel I/O algorithm, multiprocessor file system, workload characterization, parallel file access pattern, pario-bib

Comment: A book containing papers from IOPADS '94 and IOPADS '95, plus several survey/tutorial papers. See the bib entries with cross-ref to iopads-book.

ipps-io93:
Ravi Jain, John Werth, and J. C. Browne, editors. Proceedings of the IPPS '93 Workshop on Input/Output in Parallel Computer Systems, Newport Beach, CA, April 1993. Some papers also published in Computer Architecture News 21(5), December 1993.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: The entire proceedings is about parallel I/O.

ipps-io94:
Ravi Jain, John Werth, and J. C. Browne, editors. Proceedings of the IPPS '94 Workshop on Input/Output in Parallel Computer Systems, April 1994. Some papers also published in Computer Architecture News 22(4), September 1994.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: The entire proceedings is about parallel I/O.

isaila:arch:
Florin Isaila. An overview of file system architectures., chapter 13, pages 273-289. Lecture Notes in Computer Science. Springer-Verlag, Dagstuhl, Germany, March 2003.

Abstract: We provide an overview of different file system architectures. We show the influence of I/O access pattern studies results on file system design. We present techniques, algorithms and data structures used in file system implementations. We overview issues related to both local and distributed file systems. We describe distributed file system architectures for different kinds of network connectivity: tightly-connected networks (clusters and supercomputers), loosely-connected networks (computational grids) or disconnected computers (mobile computing). File system architectures for both network-attached and computer-attached storage are reviewed. We show how the parallel file systems address the requirements of I/O bound parallel applications. Different file sharing semantics in distributed and parallel file systems are explored. We also present how efficient metadata management can be realized in journaled file systems.

Keywords: survey, file system architecture, pario-bib

isaila:clusterfile:
Florin Isaila and Walter F. Tichy. Clusterfile: a flexible physical layout parallel file system. Concurrency and Computation, 15(7/8):653-679, 2003.

Abstract: This paper presents Clusterfile, a parallel file system that provides parallel file access on a cluster of computers. We introduce a file partitioning model that has been used in the design of Clusterfile. The model uses a data representation that is optimized for multidimensional array partitioning while allowing arbitrary partitions. The paper shows how the file model can be employed for file partitioning into both physical subfiles and logical views. We also present how the conversion between two partitions of the same file is implemented using a general memory redistribution algorithm. We show how we use the algorithm to optimize non-contiguous read and write operations. The experimental results include performance comparisons with the Parallel Virtual File System (PVFS) and an MPI-IO implementation for PVFS.

Keywords: parallel file system, parallel I/O, pario-bib

isaila:integrating:
Florin Isaila, Guido Malpohl, Vlad Olaru, Gabor Szeder, and Walter Tichy. Integrating collective I/O and cooperative caching into the "clusterfile" parallel file system. In Proceedings of the 18th Annual International Conference on Supercomputing, pages 58-67, Sain-Malo, France, July 2004. ACM Press.

Abstract: This paper presents the integration of two collective I/O techniques into the Clusterfile parallel file system : disk-directed I/O and two-phase I/O. We show that global cooperative cache management improves the collective I/O performance. The solution focuses on integrating disk parallelism with other types of parallelism: memory (by buffering and caching on several nodes), network (by parallel I/O scheduling strategies) and processors (by redistributing the I/O related computation over several nodes). The performance results show considerable throughput increases over ROMIO's extended two-phase I/O.

Keywords: disk-directed I/O, two-phase I/O, clusterfile parallel file system, cooperative cache, pario-bib

isaila:viewio:
Florin Isaila and Walter F. Tichy. View i/o: improving the performance of non-contiguous i/o.. In IEEE International Conference on Cluster Computing, pages 336-343, Hong Kong, China, December 2003. IEEE Computer Society Press.

Abstract: This paper presents view I/O, a non-contiguous parallel I/O technique. We show that the linear file model may be an unsuitable abstraction for non-contiguous I/O optimizations. Additionally, the poor cooperation between a file system and an I/O library like MPI-IO may drastically affect the performance. View I/O has detailed knowledge about parallel structure of a file and about the potential access pattern and exploits it in order to improve performance. The access overhead is reduced by using a strategy "declare once, use several times" and by file off-set compaction. We compare and contrast view I/O with other non-contiguous I/O methods. Our measurements on a cluster of computers indicate a significant performance improvement over other approaches. (15 refs.)

Keywords: non-contiguous I/O, parallel file structure, pario-bib

itoh:pimos:
Fumihide Itoh, Takashi Chikayama, Takeshi Mori, Masaki Sato, Tatsuo Kato, and Tadashi Sato. The design of the PIMOS file system. In Proceedings of the International Conference on Fifth Generation Computer Systems, volume 1, pages 278-285. ICOT, 1992.

Keywords: parallel file system, pario-bib

Comment: File system in the PIMOS operating system for the PIM (Parallel Inference Machine) in the Fifth Generation Computer Systems project in Japan. Paper design, no results yet. Uses disks that are attached directly to the computational processors. Significant in that it does use client caches in a parallel file system. Caches are kept coherent with a centralized directory-based protocol for exclusive-writer, multiple-reader semantics, supporting sequential consistency. Disk management includes logging to surivive crashes. Bitmap free list with buddy system to support full, 1/2, and 1/4 blocks. Trick to avoid constant update of on-disk free list. My suspicion is that cache coherence protocol may be expensive, especially in larger systems.

jadav:evaluation:
Divyesh Jadav, Chutimet Srinilta, Alok Choudhary, and P. Bruce Berra. An evaluation of design tradeoffs in a high performance media-on-demand server. Multimedia Systems, 5(1):53-68, January 1997.

Abstract: One of the key components of a multi-user multimedia-on-demand system is the data server. Digitalization of traditionally analog data such as video and audio, and the feasibility of obtaining network bandwidths above the gigabit-per-second range are two important advances that have made possible the realization, in the near future, of interactive distributed multimedia systems. Secondary-to-main memory I/O technology has not kept pace with advances in networking, main memory and CPU processing power. Consequently, the performance of the server has a direct bearing on the overall performance of such a system.

In this paper we present a high-performance solution to the I/O retrieval problem in a distributed multimedia system. Parallelism of data retrieval is achieved by striping the data across multiple disks. We identify the different components that contribute to media data retrieval delay. The variable delays among these have a great bearing on the server throughput under varying load conditions. We present a buffering scheme to minimize these variations. We have implemented our model on the Intel Paragon parallel computer. The results of component-wise instrumentation of the server operation are presented and analyzed. We present experimental results that demonstrate the efficacy of the buffering scheme. Based on our experiments, a dynamic admission control policy that takes server workload into account is proposed.

Keywords: parallel I/O, I/O scheduling, multimedia, video on demand, pario-bib

Comment: Much more detailed than jadav:media-on-demand. Here they present less survey information, and all the details on their Paragon implementation/simulation. They experiment with many tradeoffs, and propose and evaluate several scheduling and admission-control algorithms.

jadav:ioschedule:
Divyesh Jadav, Chutimet Srinilta, Alok Choudhary, and P. Bruce Berra. Design and evaluation of data access strategies in a high performance multimedia-on-demand server. In Proceedings of the Second IEEE International Conference on Multimedia Computing and Systems, pages 286-291, May 1995.
See also later version jadav:j-ioschedule.

Abstract: One of the key components of a multi user multimedia on demand system is the data server. Digitization of traditionally analog data such as video and audio, and the feasibility of obtaining network bandwidths above the gigabit per second range are two important advances that have made possible the realization, in the near future, of interactive distributed multimedia systems. Secondary-to-main memory I/O technology has not kept pace with advances in networking, main memory and CPU processing power. Consequently, the performance of the server has a direct bearing on the overall performance of such a system. We develop a model for the architecture of a server for such a system. Parallelism of data retrieval is achieved by striping the data across multiple disks. The performance of any server ultimately depends on the data access patterns. Two modifications of the basic retrieval algorithm are presented to exploit data access patterns in order to improve system throughput and response time. A complementary information caching optimization is discussed. Finally, we present performance results of these algorithms on the IBM SP1 and Intel Paragon parallel computers.

Keywords: parallel I/O, multimedia, pario-bib

Comment: Journal version is jadav:j-ioschedule? See also jadav:media-on-demand. [Comments based on a much earlier version.] They propose I/O scheduling algorithms for multimedia file servers. They assume an MIMD architecture with no shared memory and with a disk on every node. One node is essentially a manager for new requests. Another set are interface nodes, each managing the data flow for a few multimedia data streams. The majority are server nodes, responsible just for fetching their data from disk and sending it to the interface nodes. The interface nodes assemble data from the server nodes into a data stream, and send it on out to the client. They describe algorithms for scheduling requests from the interface node to the server node, and for sending data out to the client. They also describe an algorithm for determining whether the system can accept a new request.

jadav:ioschedule2:
Divyesh Jadav, Chutimet Srinilta, and Alok Choudhary. I/O scheduling tradeoffs in a high performance media-on-demand server. In Proceedings of the 1995 International Conference on High Performance Computing, pages 154-159, New Delhi, India, December 1995.
See also later version jadav:j-ioschedule.

Keywords: multimedia, scheduling, parallel I/O, pario-bib

Comment: See also jadav:ioschedule, jadav:j-ioschedule.

jadav:j-ioschedule:
Divyesh Jadav, Chutimet Srinilta, Alok Choudhary, and P. Bruce Berra. Techniques for scheduling I/O in a high performance multimedia-on-demand server. Journal of Parallel and Distributed Computing, pages 190-203, November 1996.
See also earlier version jadav:ioschedule.

Keywords: parallel I/O, I/O scheduling, multimedia, video on demand, pario-bib

Comment: Conference version is jadav:ioschedule; similar abstract. See jadav:media-on-demand.

jadav:media:
D. Jadav, C. Srinilta, and A. Choudhary. Batching and dynamic allocation techniques for increasing the stream capacity of an on-demand media server. Parallel Computing, 23(12):1727-1742, December 1997.

Abstract: A server for an interactive distributed multimedia system may require thousands of gigabytes of storage space and high I/O bandwidth. In order to maximize system utilization, and thus minimize cost, the load must be balanced among the server's disks, interconnection network and scheduler. Many algorithms for maximizing retrieval capacity from the storage system have been proposed. This paper presents techniques for improving server capacity by assigning media requests to the nodes of a server so as to balance the load on the interconnection network and the scheduling nodes. Five policies for dynamic request assignment are developed. An important factor that affects data retrieval in a high-performance continuous media server is the degree of parallelism of data retrieval. The performance of the dynamic policies on an implementation of a server model developed earlier is presented for two values of the degree of parallelism.

Keywords: multimedia, parallel I/O, pario-bib

jadav:media-on-demand:
Divyesh Jadav and Alok Choudhary. Designing and implementing high performance media-on-demand servers. IEEE Parallel and Distributed Technology, 3(2):29-39, Summer 1995.

Abstract: This paper discusses the architectural requirements of a multimedia-on-demand system, with special emphasis on the media server. Although high-performance computers are the best choice for building media-on-demand servers, implementation poses many difficulties. We conclude with a discussion of the open issues regarding the design and implementation of the server.

Keywords: parallel I/O, multimedia, video on demand, pario-bib

Comment: A survery of the issues involved in designing a media-on-demand server (they really focus on temporal data like video and audio). They do have a few results comparing various granularities for disk-requests and network messages, which seem to be from an Intel Paragon implementation, although they do not describe the experimental setup. See jadav:evaluation, jadav:j-ioschedule, jadav:ioschedule.

jain:airdisks:
Ravi Jain and John Werth. Airdisks and airraid: Modeling and scheduling periodic wireless data. Computer architecture news, 23(4):23-28, September 1995.

Keywords: wireless communication, mobile computing, RAID, parallel I/O, pario-bib

Comment: They discuss the idea of broadcasting a disk's data over the air, so PDAs can 'read' the disk by waiting for the necessary data to come along. Good for read-only or write-rarely disks. They discuss the idea of dividing the air into multiple (frequency or time) tracks and 'striping' data across the tracks for better bandwidth and reliability.

jain:jschedule:
Ravi Jain, Kiran Somalwar, John Werth, and J. C. Browne. Heuristics for scheduling I/O operations. IEEE Transactions on Parallel and Distributed Systems, 8(3):310-320, March 1997.

Abstract: The I/O bottleneck in parallel computer systems has recently begun receiving increasing interest. Most attention has focused on improving the performance of I/O devices using fairly low-level parallelism in techniques such as disk striping and interleaving. Widely applicable solutions, however, will require an integrated approach which addresses the problem at multiple system levels, including applications, systems software, and architecture. We propose that within the context of such an integrated approach, scheduling parallel I/O operations will become increasingly attractive and can potentially provide substantial performance benefits.

We describe a simple I/O scheduling problem and present approximate algorithms for its solution. The costs of using these algorithms in terms of execution time, and the benefits in terms of reduced time to complete a batch of I/O operations, are compared with the situations in which no scheduling is used, and in which an optimal scheduling algorithm is used. The comparison is performed both theoretically and experimentally. We have found that, in exchange for a small execution time overhead, the approximate scheduling algorithms can provide substantial improvements in I/O completion times.

Keywords: network, graph coloring, multiprocessor file system, resource allocation, scheduling, parallel I/O, pario-bib

Comment: See also jain:pario

jain:pario:
Ravi Jain, Kiran Somalwar, John Werth, and J. C. Browne. Scheduling parallel I/O operations in multiple bus systems. Journal of Parallel and Distributed Computing, 16(4):353-362, December 1992.

Keywords: parallel I/O, shared memory, scheduling, pario-bib

Comment: An algorithm to schedule (off-line) a set of transfers between P procs and D disks, such that no proc or disk does more than one request at a time, and no more than K transfers are concurrent (due to channel limits), with integer arbitrary-length transfers that are preemptable (ie segmentable). Much faster than previous algorithms. Problems, IMHO: off-line is only good for batch executions with known needs (ok for big collective I/Os I suppose). All k channels are usable by all proc-disk pairs, may not be realistic. No accomodation for big difference in disk and channel time, ie, disk probably can't do a channel transfer every time unit. Allows transfers in any order, which means disk seeks could be bad. No cost for preemption of a transfer, which could mean more message overhead if more messages are needed to do a given transfer. Assumes all transfers have predictable time. Still, it could be useful in some situations, esp. where order really doesn't matter.

jain:pario-intro:
Ravi Jain, John Werth, and J. C. Browne. I/O in parallel and distributed systems: An introduction. In Jain et al. [iopads-book], chapter 1, pages 3-30.

Abstract: We sketch the reasons for the I/O bottleneck in parallel and distributed systems, pointing out that it can be viewed as a special case of a general bottleneck that arises at all levels of the memory hierarchy. We argue that because of its severity, the I/O bottleneck deserves systematic attention at all levels of system design. We then present a survey of the issues raised by the I/O bottleneck in five key areas of parallel and distributed systems: applications, algorithms, compilers, operating systems and architecture. Finally, we address some of the trends we observe emerging in new paradigms of parallel and distributed computing: the convergence of networking and I/O, I/O for massively distributed ``global information systems'' such as the World Wide Web, and I/O for mobile computing and wireless communications. These considerations suggest exciting new research directions in I/O for parallel and distributed systems in the years to come.

Keywords: parallel I/O, out-of-core, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

jensen:thesis:
David Wayne Jensen. Disk I/O In High-Performance Computing Systems. PhD thesis, Univ. Illinois, Urbana-Champagne, 1993.

Keywords: parallel I/O, file access pattern, multiprocessor file system, pario-bib

Comment: He looks at the effect of I/O traffic on memory access in a multistage network, and custom mappings of file data to disks to support non-sequential I/O. He considers both the traditional ``multiuser'' workload and the case where a application accesses a single file in parallel. Assumes a dance-hall shared-memory MIMD base architecture (CEDAR). Disks are attached either to the memory or processor side of the network, and in either case require four network traversals per read/write operation. Nice summary of previous parallel I/O architectures, and characterization of the workload. Main conclusions: the network is not an inherent bottleneck, but I/O traffic can cause up to 50% loss in memory traffic bandwidth, and bursts of I/O can saturate the network. For a high I/O request rate (eg, all procs active), spread each request over a small number of disks (eg, one), whereas for a low I/O request rate (eg, one proc active) spread each request over lots of disks (eg, all). This avoids cache thrashing when multiple procs hit on one disk node. However, if they are all reading the same data, then there is no cache thrashing and you want to maximize parallelism across disks. When accessing disjoint parts of the same file, it is sometimes better to have one proc do all the accesses, because this avoids out-of-order requests that spoil prefetching, and it avoids contention by multiple procs. No single file-to-disk mapping worked for everything; interleaved (striped) worked well for most sequential patterns, but ``sequential'' (partitioned) mappings worked better for multiple-process loads that tend to focus each process on a disk, eg, an interleaved pattern where the stride is equal to the number of disks. Thus, if your pattern can get you disk locality, use a mapping that will provide it.

jeong:inverted:
Byeong-Soo Jeong and Edward Omiecinski. Inverted file partitioning schemes in multiple disk systems. IEEE Transactions on Parallel and Distributed Systems, 6(2):142-153, April 1995.

Keywords: parallel I/O, pario-bib

Comment: Ways to distribute data across multiple disks to speed information retrieval, given an inverted index. Based on a shared-everything multiprocessor model.

jin:io-book:
Hai Jin, Toni Cortes, and Rajkumar Buyya, editors. High Performance Mass Storage and Parallel I/O: Technologies and Applications. IEEE Computer Society Press and Wiley, New York, NY, 2001.

Keywords: RAID, disk array, parallel file system, caching, prefetching, multiprocessor file system, parallel I/O applications, parallel I/O, pario-bib

Comment: An excellent collection of papers that were mostly published earlier.

jin:striping:
H. Jin and K. Hwang. Optimal striping in RAID architecture. Concurrency- Practice and Experience, 12(10):909-916, August 2000.

Abstract: To access a RAID (redundant arrays of inexpensive disks), the disk stripe size greatly affects the performance of the disk array. In this article, we present a performance model to analyze the effects of striping with different stripe sizes in a RAID. The model can be applied to optimize the stripe size. Compared with previous approaches, our model is simpler to apply and more accurately reveals the real performance. Both system designers and users can apply the model to support parallel I/O events

Keywords: parallel I/O, RAID, disk striping, pario-bib

johnson:insertions:
Theodore Johnson. Supporting insertions and deletions in striped parallel filesystems. In Proceedings of the Seventh International Parallel Processing Symposium, pages 425-433, Newport Beach, CA, 1993. IEEE Computer Society Press.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: If you insert blocks into a striped file, you mess up the nice striping. So he breaks the file into striped extents, and keeps track of the extents with a distributed B-tree index. Deletions also fit into the same scheme.

johnson:scx:
Steve Johnson and Steve Scott. A supercomputer system interconnect and scalable IOS. In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 357-367. IEEE Computer Society Press, September 1995.

Abstract: The evolution of system architectures and system configurations has created the need for a new supercomputer system interconnect. Attributes required of the new interconnect include commonality among system and subsystem types, scalability, low latency, high bandwidth, a high level of resiliency, and flexibility. Cray Research Inc. is developing a new system channel to meet these interconnect requirements in future systems. The channel has a ring-based architecture, but can also function as a point-to-point link. It integrates control and data on a single, physical path while providing low latency and variance for control messages. Extensive features for client isolation, diagnostic capabilities, and fault tolerance have been incorporated into the design. The attributes and features of this channel are discussed along with implementation and protocol specifics.

Keywords: mass storage, I/O architecture, I/O interconnect, supercomputer, parallel I/O, pario-bib

Comment: About the Cray Research SCX channel, capable of 1200 MB/s peak and 900 MB/s delivered throughput.

johnson:wave:
Olin G. Johnson. Three-dimensional wave equation computations on vector computers. Proceedings of the IEEE, 72(1):90-95, January 1984.

Keywords: computational physics, parallel I/O, pario-bib

Comment: Old paper on the need for large memory and fast paging and I/O in out-of-core solutions to 3-d seismic modeling. They used 4-way parallel I/O to support their job. Needed to transfer a 3-d matrix in and out of memory by rows, columns, and vertical columns. Stored in block-structured form to improve locality on the disk.

jones:mpi-io:
Terry Jones, Richard Mark, Jeanne Martin, John May, Elsie Pierce, and Linda Stanberry. An MPI-IO interface to HPSS. In Proceedings of the Fifth NASA Goddard Conference on Mass Storage Systems and Technologies, pages I:37-50, September 1996.

Keywords: mass storage, parallel I/O, multiprocessor file system interface, pario-bib

jones:skyhi:
Philip W. Jones, Christopher L. Kerr, and Richard S. Hemler. Practical considerations in development of a parallel SKYHI general circulation model. Parallel Computing, 21:1677-1694, 1995.

Keywords: parallel computing, scientific computing, weather prediction, global climate model, parallel I/O, pario-bib

Comment: They talk about a weather code. There's a bit about the parallel I/O issues. They periodically write a restart file, and they write out several types of data files. They write out the data in any order, with a little mini-header in each chunk that describes the chunk. I/O was not a significant percentage of their run time on either the CM5 or C90. See hammond:atmosphere and hack:ncar in the same issue.

kallahalla:buffer-management:
M. Kallahalla and P. J. Varman. Analysis of simple randomized buffer management for parallel i/o. Information Processing Letters, 90(1):47-52, April 2004.

Abstract: Buffer management for a D-disk parallel I/O system is considered in the context of randomized placement of data on the disks. A simple prefetching and caching algorithm PHASE-LRU using bounded lookahead is described and analyzed. It is shown that PHASE-LRU performs an expected number of I/Os that is within a factor Theta(log D/log log D) of the number performed by an optimal off-line algorithm. In contrast, any deterministic buffer management algorithm with the same amount of lookahead must do at least Omega(rootD) times the number of I/Os of the optimal. (C) 2004 Elsevier B.V. All rights reserved.

Keywords: parallel I/O, prefetching, data placement, caching, buffer management, analysis, algorithms, randomization, pario-bib

kallahalla:pc-opt:
M. Kallahalla and P. J. Varman. PC-OPT: Optimal offline prefetching and caching for parallel I/O systems. IEEE Transactions on Computers, 51(11):1333-1344, November 2002.

Abstract: We address the problem of prefetching and caching in a parallel I/O system and present a new algorithm for parallel disk scheduling. Traditional buffer management algorithms that minimize the number of block misses are substantially suboptimal in a parallel I/O system where multiple I/Os can proceed simultaneously. We show that in the offline case, where a priori knowledge of all the requests is available, PC-OPT performs the minimum number of I/Os to service the given I/O requests. This is the first parallel I/O scheduling algorithm that is provably offline optimal in the parallel disk model. In the online case, we study the context of global L-block lookahead, which gives the buffer management algorithm a lookahead consisting of L distinct requests. We show that the competitive ratio of PC-OPT, with global L-block lookahead, is Θ (M - L + D), when L < M, and Θ (M D / L), when L > M, where the number of disks is D and buffer size is M.

Keywords: parallel I/O, file prefetching, pario-bib

kallahalla:prefetch:
Mahesh Kallahalla and Peter J. Varman. Optimal prefetching and caching for parallel I/O systems. In Proceedings of the Thirteenth Symposium on Parallel Algorithms and Architectures, pages 219-228. ACM Press, July 2001. To appear.

Abstract: We address the problem of prefetching and caching in a parallel I/O system and present a new algorithm for optimal parallel-disk scheduling. Traditional buffer management algorithms that minimize the number of I/O disk accesses, are substantially suboptimal in a parallel I/O system where multiple I/Os can proceed simultaneously.

We present a new algorithm Super for parallel-disk I/O scheduling. We show that in the off-line case, where a priori knowledge of all the requests is available, Super performs the minimum number of I/Os to service the given I/O requests. This is the first parallel I/O scheduling algorithm that is provably offline optimal. In the on-line case, we study Super in the context of global L-block lookahead, which gives the buffer management algorithm a lookahead consisting of L distinct requests. We show that the competitive ratio of Super, with global L-block lookahead, is Theta(M-L+D), when L < M, and Theta(MD/L), when L >= M, where the number of disks is D and buffer size is M.

Keywords: parallel I/O, prefetch, disk cache, pario-bib

kallahalla:read-once:
Mahesh Kallahalla and Peter J. Varman. Optimal read-once parallel disk scheduling. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems, pages 68-77, Atlanta, GA, May 1999. ACM Press.

Abstract: We present an optimal algorithm, L-OPT, for prefetching and I/O scheduling in parallel I/O systems using a read-once model of block reference. The algorithm uses knowledge of the next L block references, L-block lookahead, to schedule I/Os in an on-line manner. It uses a dynamic priority assignment scheme to decide when blocks should be prefetched, so as to minimize the total number of I/Os. The parallel disk model of an I/O system is used to study the performance of L-OPT. We show that L-OPT is comparable to the best on-line algorithm with the same amount of lookahead; the ratio of the length of its schedule to the length of the optimal schedule is within a constant factor of the best possible. Specifically, we show that the competitive ratio of L-OPT is $Ω(\sqrt{MD/L})$ which matches the lower bound on the competitive ratio of any prefetching algorithm with L-block lookahead. In addition we show that when the lookahead consists of the entire reference string, L-OPT performs the minimum possible number of I/Os; hence L-OPT is the optimal off-line algorithm. Finally, using synthetic traces we empirically study the performance characteristics of L-OPT.

Keywords: disk scheduling, parallel I/O, pario-bib

kalns:video:
Edgar T. Kalns and Yarsun Hsu. Video on demand using the Vesta parallel file system. In Proceedings of the IPPS '95 Workshop on Input/Output in Parallel and Distributed Systems, pages 30-46, April 1995.
See also later version kalns:video-book.

Keywords: parallel I/O, multimedia, multiprocessor file system, pario-bib

Comment: Hook a video-display system to the compute node of an SP-1 running Vesta, and then use Vesta file system to serve the video.

kalns:video-book:
Edgar T. Kalns and Yarsun Hsu. Video on demand using the Vesta parallel file system. In Jain et al. [iopads-book], chapter 8, pages 187-204.
See also earlier version kalns:video.

Abstract: Video on Demand (VoD) servers are expected to serve hundreds of customers with as many, or more, movie videos. Such an environment requires large storage capacity and real-time, high-bandwidth transmission capabilities. Massive striping of videos across disk arrays is a viable means to store large amounts of video data and, through parallelism of file access, achieve the needed bandwidth. The Vesta Parallel File System facilitates parallel access from an application to files distributed across a set of I/O processors, each with a set of attached disks. Given Vesta's parallel file access capabilities, this paper examines a number of issues pertaining to the implementation of VoD services on top of Vesta. We develop a prototype VoD experimentation environment on an IBM SP-1 and analyze Vesta's performance in video data retrieval for real-time playback. Specifically, we explore the impact of concurrent video streams competing for I/O node resources, cache effects, and video striping across multiple I/O nodes.

Keywords: parallel I/O, parallel file system, video on demand, multimedia, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

kamalvanshi:pdfs:
Ajay Kamalvanshi, S. K. Ghoshal, and R. C. Hansdah. Design, implementation, and performance evaluation of a parallel distributed file system. In Proceedings of the 1995 International Conference on High Performance Computing, pages 125-129, New Delhi, India, December 1995.

Keywords: parallel file system, parallel I/O, pario-bib

kandaswamy:evaluation:
Meenakshi A. Kandaswamy, Mahmut Kandemir, Alok Choudhary, and David Bernholdt. An experimental evaluation of I/O optimizations on different applications. IEEE Transactions on Parallel and Distributed Systems, 13(7):728-744, July 2002.

Abstract: any large-scale applications have significant I/O requirements as well as computational and memory requirements. Unfortunately, the limited number of I/O nodes provided in a typical configuration of the modern message-passing distributed-memory architectures such as the Intel Paragon and the IBM SP-2 limits the I/O performance of these applications severely. In this paper, we examine some software optimization techniques and evaluate their effects in five different I/O-intensive codes from both small and large application domains. Our goals in this study are twofold. First, we want to understand the behavior of large-scale data-intensive applications and the impact of I/O subsystems on their performance and vice versa. Second, and more importantly, we strive to determine the solutions for improving the applications' performance by a mix of software techniques. Our results reveal that different applications can benefit from different optimizations. For example, we found that some applications benefit from file layout optimizations, whereas others take advantage of collective I/O. A combination of architectural and software solutions is normally needed to obtain good I/O performance. For example, we show that with a limited number of I/O resources, it is possible to obtain good performance by using appropriate software optimizations. We also show that beyond a certain level, imbalance in the architecture results in performance degradation even when using optimized software, thereby indicating the necessity of an increase in I/O resources.

Keywords: parallel I/O, parallel application, pario-bib

kandaswamy:hartree:
Meenakshi A. Kandaswamy, Mahmut T. Kandemir, Alok N. Choudhary, and David E. Bernholdt. Optimization and evaluation of Hartree-Fock application's I/O with PASSION. In Proceedings of SC97: High Performance Networking and Computing, San Jose, November 1997. ACM Press.
See also later version kandaswamy:hartree-fock.

Abstract: Parallel machines are an important part of the scientific application developer's tool box and the computational and processing demands placed on these machines are rapidly increasing. Many scientific applications tend to perform high volume data storage, data retrieval and data processing, which demands high performance from the I/O subsystem. In this paper, we conduct an experimental study of the I/O performed by the Hartree-Fock (HF) method, as implemented using a fully distributed data approach in the NWChem parallel computational chemistry package. We use PASSION, a parallel and scalable I/O library and it's optmizations such as prefetching to improve the I/O performance of the HF application and we also present extensive experimental results of the same. The effects of both application-related factors and system-related factors on the application's I/O performance are studied in detail. We rank the optimizations based on the significance and impact on the performance of HF's I/O phase as: I. efficient interface to the file system, II. prefetching optimization, and III. buffering. The results show that within the limits of our experimental parameters, application-related factors are more effective on the overall I/O behavior of this application. We obtained up to 95% improvement in I/O time and 43% improvement in the overall application performance with these optimizations.

Keywords: parallel I/O, scientific computing, pario-bib

Comment: No page numbers: proceedings on CDROM and web only.

kandaswamy:hartree-fock:
Meenakshi Kandaswamy, Mahmut Kandemir, Alok Choudhary, and David Bernholdt. An experimental study to analyze and optimize hartree-fock application's I/O with PASSION. The International Journal of High Performance Computing Applications, 12(4):411-439, Winter 1998. In a Special Issue on I/O in Parallel Applications.
See also earlier version kandaswamy:hartree.

Abstract: Many scientific applications tend to perform high volume data storage, data retrieval and data processing, which demands high performance from the I/O subsystem. The focus and contribution of this work is to study the I/O behavior of the Hartree-Fock method using PASSION. HF's I/O phases can contribute up to 62.34% of the total execution time. We reduce the execution time and I/O time up to 54% and 6% respectively of that of the original case through PASSION and its optimizations. Additionally, we categorize the factors that affect the I/O performance of HF into key application-related parameters and key system-related parameters. Based on extensive empirical results and within our experimental space, we order the parameters according to their impact on HF's I/O performance as follows: efficient interface, prefetching, buffering, number of I/O nodes, striping factor and striping unit. We conclude that application-related factors have a more significant effect on HF's I/O performance than the system-related factors within our experimental space.

Keywords: parallel I/O application, pario-bib

Comment: In a Special Issue on I/O in Parallel Applications, volume 12, numbers 3 and 4.

kandemir:compiler:
Mahmut Kandemir. Compiler-directed collective I/O. IEEE Transactions on Parallel and Distributed Systems, 12(12):1318-1331, December 2001.

Abstract: Current approaches to parallel I/O demand extensive user effort to obtain acceptable performance. This is in part due to difficulties in understanding the characteristics of a wide variety of I/O devices and in part due to inherent complexity of I/O software. While parallel I/O systems provide users with environments where persistent data sets can be shared between parallel processors, the ultimate performance of I/O-intensive codes depends largely on the relation between data access patterns exhibited by parallel processors and storage patterns of data in files and on disks. In cases where access patterns and storage patterns match, we can exploit parallel I/O hardware by allowing each processor to perform independent parallel I/O. In order to keep performance decent under circumstances in which data access patterns and storage patterns do not match, several I/O optimization techniques have been developed in recent years. Collective I/O is such an optimization technique that enables each processor to do I/O on behalf of other processors if doing so improves the overall performance. While it is generally accepted that collective I/O and its variants can bring impressive improvements as far as the I/O performance is concerned, it is difficult for the programmer to use collective I/O in an optimal manner. In this paper, we propose and evaluate a compiler-directed collective I/O approach which detects the opportunities for collective I/O and inserts the necessary I/O calls in the code automatically. An important characteristic of the approach is that instead of applying collective I/O indiscriminately, it uses collective I/O selectively only in cases where independent parallel I/O would not be possible or would lead to an excessive number of I/O calls. The approach involves compiler-directed access pattern and storage pattern detection schemes that work on a multiple application environment. We have implemented the necessary algorithms in a source-to-source translator and within a stand-alone tool. Our experimental results on an SGI/Cray Origin 2000 multiprocessor machine demonstrate that our compiler-directed collective I/O scheme performs very well on different setups built using nine applications from several scientific benchmarks. We have also observed that the I/O performance of our approach is only 5.23 percent worse than an optimal scheme.

Keywords: parallel I/O, collective I/O, compiler, pario-bib

kandemir:io-optimize:
Mahmut Kandemir, Alok Choudhary, and Rajesh Bordawekar. I/O optimizations for compiling out-of-core programs on distributed-memory machines. In Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing, pages 8-9. Society for Industrial and Applied Mathematics, March 1997. To appear. Extended Abstract.

Abstract: Since many of large scale computational problems usually deal with large quantities of data, optimizing the performance of I/O subsystems of massively parallel machines is an important challenge for system designers. We describe data access reorganization strategies for efficient compilation of out-of-core data-parallel programs on distributed memory machines. Our analytical approach and experimental results indicate that the optimizations introduced in this paper can reduce the amount of time spent in I/O by as much as an order of magnitude on both uniprocessors and multicomputers.

Keywords: parallel I/O, compiler, out-of-core, pario-bib

kandemir:locality:
M. Kandemir, A. Choudhary, J. Ramanujam, and M. Kandaswamy. A unified compiler algorithm for optimizing locality, parallelism, and communication in out-of-core computations. In Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed Systems, pages 79-92, San Jose, CA, November 1997. ACM Press.

Abstract: This paper presents compiler algorithms to optimize out-of-core programs. These algorithms consider loop and data layout transformations in a unified framework. The performance of an out-of-core loop nest containing many references can be improved by a combination of restructuring the loops and file layouts. This approach considers array references one-by-one and attempts to optimize each reference for parallelism and locality. When there are references for which parallelism optimizations do not work, communication is vectorized so that data transfer can be performed before the innermost tiling loop. Preliminary results from hand-compiles on IBM SP-2 and Intel Paragon show that this approach reduces the execution time, improves the bandwidth speedup and overall speedup. In addition, we extend the base algorithm to work with file layout constraints and show how it can be used for optimizing programs consisting of multiple loop nests.

Keywords: compiler, out of core, parallel I/O, pario-bib

kandemir:ooc:
M. Kandemir, A. Choudhary, J. Ramanujam, and R. Bordawekar. Compilation techniques for out-of-core parallel computations. Parallel Computing, 24(3):597-628, May 1998.

Keywords: parallel I/O, compiler, out of core, pario-bib

kandemir:optimizations:
M. T. Kandemir. Compiler-directed optimizations for improving the performance of I/O-intensive applications. International Journal of Parallel and Distributed Systems and Networks, 5(2):52-65, 2002.

Keywords: parallel I/O, compiler, pario-bib

kandemir:optimize:
Mahmut Kandemir, Alok Choudhary, J. Ramanujam, and Rajesh Bordawekar. Optimizing out-of-core computations in uniprocessors. In Proceedings of the Workshop on Interaction between Compilers and Computer Architectures, pages 1-10. Kluwer Academic Publishers, February 1997.

Abstract: Programs accessing disk-resident arrays perform poorly in general due to excessive number of I/O calls and insufficient help from compilers. In this paper, in order to alleviate this problem, we propose a series of compiler optimizations. Both the analytical approach we use and the experimental results provide strong evidence that our method is very effective on uniprocessors for out-of-core nests whose data sizes far exceed the size of available memory.

Keywords: parallel I/O, compiler, out-of-core, pario-bib

kandemir:optimizing:
M. Kandemir, A. Choudhary, J. Ramanujam, and R. Bordawekar. Optimizing out-of-core computations in uniprocessors. Newsletter of the Technical Committee on Computer Architecture (TCCA), pages 25-27, June 1997.

Keywords: out of core, parallel I/O, pario-bib

kandemir:reorganize:
Mahmut Kandemir, Rajesh Bordawekar, and Alok Choudhary. Data access reorganizations in compiling out-of-core data parallel programs on distributed memory machines. In Proceedings of the Eleventh International Parallel Processing Symposium, pages 559-564, April 1997.

Abstract: This paper describes optimization techniques for translating out-of-core programs written in a data parallel language to message passing node programs with explicit parallel I/O. We demonstrate that straightforward extension of in-core compilation techniques does not work well for out-of-core programs. We then describe how the compiler can optimize the code by (1) determining appropriate file layouts for out-of-core arrays, (2) permuting the loops in the nest(s) to allow efficient file access, and (3) partitioning the available node memory among references based on I/O cost estimation. Our experimental results indicate that these optimizations can reduce the amount of time spent in I/O by as much as an order of magnitude.

Keywords: compiler, data-parallel, out-of-core, parallel I/O, pario-bib

kandemir:tiling:
Mahmut Kandemir, Rajesh Bordawekar, Alok Choudhary, and J. Ramanujam. A unified tiling approach for out-of-core computations. In Sixth Workshop on Compilers for Parallel Computers, pages 323-334, Aachen, Germany, December 1996. Forschungzentrum Julich GmbH. Also available as Caltech Technical Report CACR 130.

Abstract: This paper describes a framework by which an out-of-core stencil program written in a data-parallel language can be translated into node programs in a distributed-memory message-passing machine with explicit I/O and communication. We focus on a technique called \emph{Data Space Tiling} to group data elements into slabs that can fit into memories of processors. Methods to choose \emph{legal} tile shapes under several constraints and deadlock-free scheduling of tiles are investigated. Our approach is \emph{unified} in the sense that it can be applied to both FORALL loops and the loops that involve flow-dependences.

Keywords: parallel I/O, compiler, out-of-core, pario-bib

karges:par-pipe:
Jonathan Karges, Otto Ritter, and Sándor Suhai. Design and implementation of a parallel pipe. ACM Operating Systems Review, 31(2):60-94, April 1997.

Keywords: interprocess communication, parallel I/O, pario-bib

Comment: A parallel version of the Unix 'pipe' feature, for connecting the output of one program to multiple other programs or files. Implemented on Solaris. Performance results.

karpovich:bottleneck:
John F. Karpovich, Andrew S. Grimshaw, and James C. French. Breaking the I/O bottleneck at the National Radio Astronomy Observatory (NRAO). Technical Report CS-94-37, University of Virginia, August 1993.

Keywords: scientific database, parallel I/O, pario-bib

Comment: See also karpovich:case-study. That is a subset of this paper.

karpovich:case-study:
John F. Karpovich, James C. French, and Andrew S. Grimshaw. High performance access to radio astronomy data: A case study. In Proceedings of the 7th International Working Conference on Scientific and Statistical Database Management, September 1994. Also available as Univ. of Virginia TR CS-94-25.

Keywords: scientific database, parallel I/O, pario-bib

Comment: Apparently a subset of karpovich:bottleneck. They store a sparse, multidimensional data set (radio astronomy data) as a set of tagged data values, ie, as a set of tuples, each with several keys and a data value. They use a PLOP format to partition each dimension into slices, so that each intersection of the slices forms a bucket. They decide on the splits based on a preliminary statistical survey of the data. Bucket overflow is handled by chaining. Then, they evaluate various kinds of queries, ie, multidimensional range queries, for their performance. In this workload queries (reads) are much more common than updates (writes).

karpovich:elfs:
John F. Karpovich, Andrew S. Grimshaw, and James C. French. Extensible file systems (ELFS): An object-oriented approach to high performance file I/O. In Proceedings of the Ninth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications, pages 191-204, Portland, OR, October 1994. ACM Press.

Keywords: parallel I/O, multiprocessor file system interface, object oriented, pario-bib

Comment: See also grimshaw:elfs, grimshaw:ELFSTR, grimshaw:objects, and karpovich:*. This is also available as UVA TR CS-94-28. This paper focuses more on the objec-oriented nature of ELFS than on its ability to support parallel I/O. It also describes two classes they've developed, one for 2d dense matrices and another for range queries on multidimensional sparse data sets. It does have some new performance numbers for ELFS on Intel CFS.

katz:bdiskarch:
Randy H. Katz, Garth A. Gibson, and David A. Patterson. Disk system architectures for high performance computing. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 2, pages 15-34. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version katz:diskarch.

Keywords: parallel I/O, RAID, disk array, disk striping, pario-bib

Comment: Part of jin:io-book; reformatted version of katz:diskarch.

katz:diskarch:
Randy H. Katz, Garth A. Gibson, and David A. Patterson. Disk system architectures for high performance computing. Proceedings of the IEEE, 77(12):1842-1858, December 1989.
See also later version katz:bdiskarch.

Keywords: parallel I/O, RAID, disk array, disk striping, pario-bib

Comment: Good review of the background of disks and I/O architectures, but a shorter RAID presentation than patterson:RAID. Also addresses controller structure. Good ref for the I/O crisis background, though they don't use that term here. Good taxonomy of previous array techniques.

katz:io-subsys:
Randy H. Katz, John K. Ousterhout, David A. Patterson, and Michael R. Stonebraker. A project on high performance I/O subsystems. {IEEE} Database Engineering Bulletin, 11(1):40-47, March 1988.

Keywords: parallel I/O, RAID, Sprite, reliability, disk striping, disk array, pario-bib

Comment: Early RAID project paper. Describes the Berkeley team's plan to use an array of small (100M) hard disks as an I/O server for network file service, transaction processing, and supercomputer I/O. Considering performance, reliability, and flexibility. Initially hooked to their SPUR multiprocessor, using Sprite operating system, new filesystem. Either asynchronous striped or independent operation. Supercomputer I/O is characterized as sequential, minimum latency, low throughput. Use of parity disks to boost reliability. Files may be striped across one or more disks and extend over several sectors, thus a two-dimensional filesystem; striping need not involve all disks.

katz:netfs:
Randy H. Katz. Network-attached storage systems. In Scalable High Performance Computing Conference, pages 68-75, 1992.

Keywords: distributed file system, supercomputer file system, file striping, RAID, parallel I/O, pario-bib

Comment: Comments on the emerging trend of file systems for mainframes and supercomputers that are not attached directly to the computer, but instead to a network attached to the computer. Avoiding data copying seems to be a critical issue in the OS and controllers, for disk and network interfaces. Describes RAID-II prototype.

katz:update:
Randy H. Katz, John K. Ousterhout, David A. Patterson, Peter Chen, Ann Chervenak, Rich Drewes, Garth Gibson, Ed Lee, Ken Lutz, Ethan Miller, and Mendel Rosenblum. A project on high performance I/O subsystems. Computer Architecture News, 17(5):24-31, September 1989.

Keywords: parallel I/O, RAID, reliability, disk array, pario-bib

Comment: A short summary of the RAID project. Some more up-to-date info, like that they have completed the first prototype with 8 SCSI strings and 32 disks.

keane:commercial:
J. A. Keane, T. N. Franklin, A. J. Grant, R. Sumner, and M. Q. Xu. Commercial users' requirements for parallel systems. In Proceedings of the 1993 DAGS/PC Symposium, pages 15-25, Hanover, NH, June 1993. Dartmouth Institute for Advanced Graduate Studies.

Abstract: This paper reports on part of an on-going analysis of parallel systems for commercial users. The particular focus of this paper is on the requirements that commercial users, in particular users with financial database systems, have of parallel systems.

The issues of concern to such users differ from those of concern to science and engineering users. Performance of the parallel system is not the only, or indeed primary, reason for moving to such systems for commercial users. Infra-structure issues are important, such as system availability and inter-working with existing systems.

These issues are discussed in the context of a banking customer's requirements. The various technical concerns that these requirements impose are discussed in terms of commercially available systems.

Keywords: parallel architecture, parallel I/O, databases, commercial requirements, pario-bib

kennedy:sio:
Ken Kennedy, Charles Koelbel, and Mike Paleczny. Scalable I/O for out-of-core structures. Technical Report CRPC-TR93357-S, Center for Research on Parallel Computation, Rice University, November 1993. Updated August, 1994.

Keywords: parallel I/O, out-of-core, pario-bib

Comment: They describe a project they are beginning, which attempts to have the compiler analyze a program that uses large arrays, and insert explicit I/O statements to move data from those arrays to and from disk. This is seen as an alternative to OS and hardware virtual memory, and is likely to provide much better performance (so show their initial results). Their focus is on overlapping I/O and computation.

kermarrec:ha-psls:
Anne-Marie Kermarrec and Christine Morin. HA-PSLS: a highly available parallel single-level store system.. Concurrency and Computation Practice & Experience, 15(10):911-937, August 2003.

Abstract: Parallel single-level store (PSLS) systems integrate a shared virtual memory and a parallel file system. They provide programmers with a global address space including both memory and file data. PSLS systems implemented in a cluster thus represent a natural support for long-running parallel applications, combining both the natural shared memory programming model and a large and efficient file system. However, the need to tolerate failures in such a system increases with the size of applications. We present a highly-available parallel single level store system (HA-PSLS), which smoothly integrates a backward error recovery high-availability mechanism into a PSLS system. Our system is able to tolerate multiple transient failures, a single permanent failure, and power cut failures affecting the whole cluster, without requiring any specialized hardware. For this purpose, HA-PSLS relies on a high degree of integration (and reusability) of high-availability and standard features. A prototype integrating our high-availability support has been implemented and we show some performance results. (24 refs.)

Keywords: parallel single level store, high-availability, fault tolerance, checkpointing, replication, integration, parallel file systems, shared virtual memory, pario-bib

khanna:group:
Sanjay Khanna and David Kotz. A split-phase interface for parallel file systems. Technical Report PCS-TR97-312, Dept. of Computer Science, Dartmouth College, March 1997.

Abstract: We describe the effects of a new user-level library for the Galley Parallel File System. This library allows some pre-existing sequential programs to make use of the Galley Parallel File System with minimal modification. It permits programs to efficiently use the parallel file system because the user-level library groups accesses together. We examine the performance of our library, and we show how code needs to be modified to use the library.

Keywords: multiprocessor file system interface, run-time library, parallel file system, parallel I/O, pario-bib, dfk

kim:asynch:
Michelle Y. Kim and Asser N. Tantawi. Asynchronous disk interleaving: Approximating access delays. IEEE Transactions on Computers, 40(7):801-810, July 1991.

Keywords: disk interleaving, parallel I/O, performance modeling, pario-bib

Comment: As opposed to synchronous disk interleaving, where disks are rotationally synchronous and one access is processed at a time. They develop a performance model and validate it with traces of a database system's disk accesses. Average access delay on each disk can be approximated by a normal distribution.

kim:fft:
Michelle Y. Kim, Anil Nigam, George Paul, and Robert H. Flynn. Disk interleaving and very large fast Fourier transforms. International Journal of Supercomputer Applications, 1(3):75-96, 1987.

Keywords: parallel I/O, disk striping, scientific computing, algorithm, pario-bib

kim:interleave:
Michelle Y. Kim. Synchronously Interleaved Disk Systems with their Application to the Very Large FFT. PhD thesis, IBM Thomas J. Watson Research Center, Yorktown Heights, New York 10598, 1986. IBM Report number RC12372.
See also earlier version kim:interleaving.

Keywords: parallel I/O, disk striping, file access pattern, disk array, pario-bib

Comment: Uniprocessor interleaving techniques. Good case for interleaving. Probably better to reference kim:interleaving and kim:fft. Discusses an 3D FFT algorithm in which the matrix is broken into subblocks that are accessed in layers. The layers are stored so this is either contiguous or with a regular stride, in fairly large chunks.

kim:interleaving:
Michelle Y. Kim. Synchronized disk interleaving. IEEE Transactions on Computers, C-35(11):978-988, November 1986.
See also later version kim:interleave.

Keywords: parallel I/O, disk striping, disk array, pario-bib

kimbrel:prefetch:
Tracy Kimbrel, Pei Cao, Edward Felten, Anna Karlin, and Kai Li. Integrating parallel prefetching and caching. In Proceedings of the 1996 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 262-263, Philadelphia, PA, May 1996. ACM Press. Poster paper.

Keywords: disk prefetching, parallel I/O, pario-bib

Comment: They do a theoretical analysis of prefetching and caching in uniprocessor, single- and multi-disk situations, given that they know the complete access sequence; their measure is not hit rate but rather overall execution time. They found some algorithms that are close to optimal.

kimbrel:prefetch-trace:
Tracy Kimbrel, Andrew Tomkins, R. Hugo Patterson, Brian Bershad, Pei Cao, Edward Felten, Garth Gibson, Anna R. Karlin, and Kai Li. A trace-driven comparison of algorithms for parallel prefetching and caching. In Proceedings of the 1996 Symposium on Operating Systems Design and Implementation, pages 19-34. USENIX Association, October 1996.

Abstract: High-performance I/O systems depend on prefetching and caching in order to deliver good performance to applications. These two techniques have generally been considered in isolation, even though there are significant interactions between them; a block prefetched too early reduces the effectiveness of the cache, while a block cached too long reduces the effectiveness of prefetching. In this paper we study the effects of several combined prefetching and caching strategies for systems with multiple disks. Using disk-accurate trace-driven simulation, we explore the performance characteristics of each of the algorithms in cases in which applications provide full advance knowledge of accesses using hints. Some of the strategies have been published with theoretical performance bounds, and some are components of systems that have been built. One is a new algorithm that combines the desirable characteristics of the others. We find that when performance is limited by I/O stalls, aggressive prefetching helps to alleviate the problem; that more conservative prefetching is appropriate when significant I/O stalls are not present; and that a single, simple strategy is capable of doing both.

Keywords: parallel I/O, tracing, prefetch, trace-driven simulation, pario-bib

kitsuregawa:sdc:
Masaru Kitsuregawa, Satoshi Hirano, Masanobu Harada, Minoru Nakamura, and Mikio Takagi. The Super Database Computer (SDC): System architecture, algorithm and preliminary evaluation. In Proceedings of the Twenty-Fifth Annual Hawaii International Conference on System Sciences, volume I, pages 308-319, 1992.

Keywords: parallel database, parallel I/O, pario-bib

Comment: Most interesting to me in this paper is their discussion of the ``container model,'' in which they claim they allow the processors to be driven by the I/O devices. See hirano:deadlock.

klaskey:data-streaming:
Scott Alan Klasky, Stephane Ethier, Zhihong Lin, Kevin Martins, Doug McCune, and Ravi Samtaney. Grid-based parallel data streaming implemented for the gyrokinetic toroidal code. In Proceedings of SC2003: High Performance Networking and Computing, Phoenix, AZ, November 2003. IEEE Computer Society Press.

Abstract: We have developed a threaded parallel data streaming approach using Globus to transfer multi-terabyte simulation data from a remote supercomputer to the scientist's home analysis/visualization cluster, as the simulation executes, with negligible overhead. Data transfer experiments show that this concurrent data transfer approach is more favorable compared with writing to local disk and then transferring this data to be post-processed. The present approach is conducive to using the grid to pipeline the simulation with post-processing and visualization. We have applied this method to the Gyrokinetic Toroidal Code (GTC), a 3-dimensional particle-in-cell code used to study micro-turbulence in magnetic confinement fusion from first principles plasma theory.

Keywords: grid, parallel data streams, hydrodynamics, application, parallel I/O, pario-app, pario-bib

Comment: published on the web

klimkowski:solver:
Ken Klimkowski and Robert van de Geijn. Anatomy of an out-of-core dense linear solver. In Proceedings of the 1995 International Conference on Parallel Processing, pages III:29-33, St. Charles, IL, 1995. CRC Press.

Abstract: In this paper, we describe the design and implementation of the Platform Independent Parallel Solver (PIPSolver) package for the out-of-core (OOC) solution of complex dense linear systems. Our approach is unique in that it allows essentially all of RAM to be filled with the current portion of the matrix (slab) to be updated and factored, thereby greatly improving the computation to I/O ratio over previous approaches. This work could be viewed in part as an exercise in maximal code reuse: By formulating the OOC LU factorization just right, we managed to reuse essentially all of a very robust and efficient incore solver, leading directly to a very robust and efficient OOC solver. Experiences and performance are reported for the Cray T3D system.

Keywords: out-of-core algorithm, parallel I/O, pario-bib

kobler:eosdis:
Ben Kobler, John Berbert, Parris Caulk, and P. C. Hariharan. Architecture and design of storage and data management for the NASA Earth Observing System Data and Information System (EOSDIS). In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 65-76. IEEE Computer Society Press, September 1995.

Abstract: Mission to Planet Earth (MTPE) is a long-term NASA research mission to study the processes leading to global climate change. The EOS Data and Information System (EOSDIS) is the component within MTPE that will provide the Earth science community with easy, affordable, and reliable access to Earth science data. EOSDIS is a distributed system, with major facilities at eight Distributed Active Archive Centers (DAACs) located throughout the United States. At the DAACs the Science Data Processing Segment (SDPS) will receive, process, archive, and manage all data. It is estimated that several hundred gigaflops of processing power will be required to process and archive the several terabytes of new data that will be generated and distributed daily. Thousands of science users and perhaps several hundred thousand nonscience users will access the system.

Keywords: mass storage, I/O architecture, parallel I/O, pario-bib

kotz:app-pario:
David Kotz. Applications of parallel I/O. Technical Report PCS-TR96-297, Dept. of Computer Science, Dartmouth College, October 1996. Release 1.
See also later version oldfield:app-pario.

Abstract: Scientific applications are increasingly being implemented on massively parallel supercomputers. Many of these applications have intense I/O demands, as well as massive computational requirements. This paper is essentially an annotated bibliography of papers and other sources of information about scientific applications using parallel I/O. It will be updated periodically.

Keywords: parallel I/O application, file access patterns, dfk, pario-bib

kotz:bdiskdir:
David Kotz. Disk-directed I/O for MIMD multiprocessors. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 35, pages 513-535. IEEE Computer Society Press and John Wiley & Sons, 2001.
Identical to kotz:jdiskdir.

Abstract: Many scientific applications that run on today's multiprocessors, such as weather forecasting and seismic analysis, are bottlenecked by their file-I/O needs. Even if the multiprocessor is configured with sufficient I/O hardware, the file-system software often fails to provide the available bandwidth to the application. Although libraries and enhanced file-system interfaces can make a significant improvement, we believe that fundamental changes are needed in the file-server software. We propose a new technique, disk-directed I/O, to allow the disk servers to determine the flow of data for maximum performance. Our simulations show that tremendous performance gains are possible both for simple reads and writes and for an out-of-core application. Indeed, our disk-directed I/O technique provided consistent high performance that was largely independent of data distribution, obtained up to 93% of peak disk bandwidth, and was as much as 18 times faster than the traditional technique.

Keywords: parallel I/O, multiprocessor file system, file system caching, dfk, pario-bib

Comment: In jin:io-book, reprinted from kotz:jdiskdir.

kotz:bpractical:
David Kotz and Carla Schlatter Ellis. Practical prefetching techniques for multiprocessor file systems. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 17, pages 245-258. IEEE Computer Society Press and John Wiley & Sons, New York, NY, 2001.
Identical to kotz:jpractical.

Abstract: Improvements in the processing speed of multiprocessors are outpacing improvements in the speed of disk hardware. Parallel disk I/O subsystems have been proposed as one way to close the gap between processor and disk speeds. In a previous paper we showed that prefetching and caching have the potential to deliver the performance benefits of parallel file systems to parallel applications. In this paper we describe experiments with practical prefetching policies that base decisions only on on-line reference history, and that can be implemented efficiently. We also test the ability of these policies across a range of architectural parameters.

Keywords: dfk, parallel file system, prefetching, disk caching, parallel I/O, MIMD, pario-bib

Comment: Reformatted version of kotz:jpractical. In jin:io-book.

kotz:diskdir:
David Kotz. Disk-directed I/O for MIMD multiprocessors. In Proceedings of the 1994 Symposium on Operating Systems Design and Implementation, pages 61-74. USENIX Association, November 1994. Updated as Dartmouth TR PCS-TR94-226 on November 8, 1994.
See also later version kotz:diskdir-tr.

Abstract: Many scientific applications that run on today's multiprocessors are bottlenecked by their file I/O needs. Even if the multiprocessor is configured with sufficient I/O hardware, the file-system software often fails to provide the available bandwidth to the application. Although libraries and improved file-system interfaces can make a significant improvement, we believe that fundamental changes are needed in the file-server software. We propose a new technique, disk-directed I/O, that flips the usual relationship between server and client to allow the disks (actually, disk servers) to determine the flow of data for maximum performance. Our simulations show that tremendous performance gains are possible. Indeed, disk-directed I/O provided consistent high performance that was largely independent of data distribution, and close to the maximum disk bandwidth.

Keywords: parallel I/O, multiprocessor file system, file system caching, pario-bib, dfk

Comment: This paper also appeared in Bulletin of the IEEE Technical Committee on Operating Systems and Application Environments, Autumn 1994, pp. 29-42. Also available at http://www.usenix.org/publications/library/proceedings/osdi/kotz.html.

SEE TECH REPORT kotz:diskdir-tr. Please note that the tech report contains newer numbers than those in the OSDI version, although the conclusions have not changed.

kotz:diskdir-tr:
David Kotz. Disk-directed I/O for MIMD multiprocessors. Technical Report PCS-TR94-226, Dept. of Computer Science, Dartmouth College, July 1994. Revised November 8, 1994.
See also earlier version kotz:diskdir.
See also later version kotz:jdiskdir.

Abstract: Many scientific applications that run on today's multiprocessors are bottlenecked by their file I/O needs. Even if the multiprocessor is configured with sufficient I/O hardware, the file-system software often fails to provide the available bandwidth to the application. Although libraries and improved file-system interfaces can make a significant improvement, we believe that fundamental changes are needed in the file-server software. We propose a new technique, disk-directed I/O, that flips the usual relationship between server and client to allow the disks (actually, disk servers) to determine the flow of data for maximum performance. Our simulations show that tremendous performance gains are possible. Indeed, disk-directed I/O provided consistent high performance that was largely independent of data distribution, and close to the maximum disk bandwidth.

Keywords: parallel I/O, multiprocessor file system, file system caching, dfk, pario-bib

Comment: Short version appeared in OSDI'94. Please note that the revised tech report contains newer numbers than those in the OSDI version, although the conclusions have not changed.

kotz:diskdir2:
David Kotz. Disk-directed I/O for MIMD multiprocessors. Bulletin of the IEEE Technical Committee on Operating Systems and Application Environments, pages 29-42, Autumn 1994.
See also later version kotz:diskdir-tr.

Abstract: Many scientific applications that run on today's multiprocessors are bottlenecked by their file I/O needs. Even if the multiprocessor is configured with sufficient I/O hardware, the file-system software often fails to provide the available bandwidth to the application. Although libraries and improved file-system interfaces can make a significant improvement, we believe that fundamental changes are needed in the file-server software. We propose a new technique, disk-directed I/O, that flips the usual relationship between server and client to allow the disks (actually, disk servers) to determine the flow of data for maximum performance. Our simulations show that tremendous performance gains are possible. Indeed, disk-directed I/O provided consistent high performance that was largely independent of data distribution, and close to the maximum disk bandwidth.

Keywords: parallel I/O, multiprocessor file system, file system caching, pario-bib, dfk

Comment: Same as kotz:diskdir.

SEE TECH REPORT kotz:diskdir-tr. Please note that the tech report contains newer numbers than those in the OSDI version, although the conclusions have not changed.

kotz:encyc1:
David Kotz and Ravi Jain. I/O in parallel and distributed systems. In Allen Kent and James G. Williams, editors, Encyclopedia of Computer Science and Technology, volume 40, pages 141-154. Marcel Dekker, Inc., 1999. Supplement 25.

Abstract: We sketch the reasons for the I/O bottleneck in parallel and distributed systems, pointing out that it can be viewed as a special case of a general bottleneck that arises at all levels of the memory hierarchy. We argue that because of its severity, the I/O bottleneck deserves systematic attention at all levels of system design. We then present a survey of the issues raised by the I/O bottleneck in six key areas of parallel and distributed systems: applications, algorithms, languages and compilers, run-time libraries, operating systems, and architecture.

Keywords: survey, parallel I/O, pario-bib, dfk

kotz:expand:
David Kotz. Expanding the potential for disk-directed I/O. In Proceedings of the 1995 IEEE Symposium on Parallel and Distributed Processing, pages 490-495, San Antonio, TX, October 1995. IEEE Computer Society Press.
See also earlier version kotz:expand-tr.

Abstract: As parallel computers are increasingly used to run scientific applications with large data sets, and as processor speeds continue to increase, it becomes more important to provide fast, effective parallel file systems for data storage and for temporary files. In an earlier work we demonstrated that a technique we call disk-directed I/O has the potential to provide consistent high performance for large, collective, structured I/O requests. In this paper we expand on this potential by demonstrating the ability of a disk-directed I/O system to read irregular subsets of data from a file, and to filter and distribute incoming data according to data-dependent functions.

Keywords: parallel I/O, multiprocessor file systems, dfk, pario-bib

kotz:expand-tr:
David Kotz. Expanding the potential for disk-directed I/O. Technical Report PCS-TR95-254, Dept. of Computer Science, Dartmouth College, March 1995.
See also later version kotz:expand.

Abstract: As parallel computers are increasingly used to run scientific applications with large data sets, and as processor speeds continue to increase, it becomes more important to provide fast, effective parallel file systems for data storage and for temporary files. In an earlier work we demonstrated that a technique we call disk-directed I/O has the potential to provide consistent high performance for large, collective, structured I/O requests. In this paper we expand on this potential by demonstrating the ability of a disk-directed I/O system to read irregular subsets of data from a file, and to filter and distribute incoming data according to data-dependent functions.

Keywords: parallel I/O, multiprocessor file systems, dfk, pario-bib

kotz:explore:
David Kotz and Ting Cai. Exploring the use of I/O nodes for computation in a MIMD multiprocessor. In Proceedings of the IPPS '95 Workshop on Input/Output in Parallel and Distributed Systems, pages 78-89, April 1995.
See also earlier version kotz:explore-tr.

Abstract: As parallel systems move into the production scientific-computing world, the emphasis will be on cost-effective solutions that provide high throughput for a mix of applications. Cost-effective solutions demand that a system make effective use of all of its resources. Many MIMD multiprocessors today, however, distinguish between ``compute'' and ``I/O'' nodes, the latter having attached disks and being dedicated to running the file-system server. This static division of responsibilities simplifies system management but does not necessarily lead to the best performance in workloads that need a different balance of computation and I/O.

Of course, computational processes sharing a node with a file-system service may receive less CPU time, network bandwidth, and memory bandwidth than they would on a computation-only node. In this paper we begin to examine this issue experimentally. We found that high-performance I/O does not necessarily require substantial CPU time, leaving plenty of time for application computation. There were some complex file-system requests, however, which left little CPU time available to the application. (The impact on network and memory bandwidth still needs to be determined.) For applications (or users) that cannot tolerate an occasional interruption, we recommend that they continue to use only compute nodes. For tolerant applications needing more cycles than those provided by the compute nodes, we recommend that they take full advantage of both compute and I/O nodes for computation, and that operating systems should make this possible.

Keywords: parallel I/O, multiprocessor file system, dfk, pario-bib

kotz:explore-tr:
David Kotz and Ting Cai. Exploring the use of I/O nodes for computation in a MIMD multiprocessor. Technical Report PCS-TR94-232, Dept. of Computer Science, Dartmouth College, October 1994. Revised 2/20/95.
See also later version kotz:explore.

Abstract: Most MIMD multiprocessors today are configured with two distinct types of processor nodes: those that have disks attached, which are dedicated to file I/O, and those that do not have disks attached, which are used for running applications. Several architectural trends have led some to propose configuring systems so that all processors are used for application processing, even those with disks attached. We examine this idea experimentally, focusing on the impact of remote I/O requests on local computational processes. We found that in an efficient file system the I/O processors can transfer data at near peak speeds with little CPU overhead, leaving substantial CPU power for running applications. On the other hand, we found that some complex file-system features could require substantial CPU overhead. Thus, for a multiprocessor system to obtain good I/O and computational performance on a mix of applications, the file system (both operating system and libraries) must be prepared to adapt their policies to changing conditions.

Keywords: parallel I/O, multiprocessor file system, dfk, pario-bib

kotz:flexibility:
David Kotz and Nils Nieuwejaar. Flexibility and performance of parallel file systems. ACM Operating Systems Review, 30(2):63-73, April 1996.
See also later version kotz:flexibility2.

Abstract: Many scientific applications for high-performance multiprocessors have tremendous I/O requirements. As a result, the I/O system is often the limiting factor of application performance. Several new parallel file systems have been developed in recent years, each promising better performance for some class of parallel applications. As we gain experience with parallel computing, and parallel file systems in particular, it becomes increasingly clear that a single solution does not suit all applications. For example, it appears to be impossible to find a single appropriate interface, caching policy, file structure, or disk management strategy. Furthermore, the proliferation of file-system interfaces and abstractions make application portability a significant problem.

We propose that the traditional functionality of parallel file systems be separated into two components: a fixed core that is standard on all platforms, encapsulating only primitive abstractions and interfaces, and a set of high-level libraries to provide a variety of abstractions and application-programmer interfaces (APIs). We think of this approach as the ``RISC'' of parallel file-system design.

We present our current and next-generation file systems as examples of this structure. Their features, such as a three-dimensional file structure, strided read and write interfaces, and I/O-node programs, are specifically designed with the flexibility and performance necessary to support a wide range of applications.

Keywords: parallel I/O, multiprocessor file system, dfk, pario-bib

Comment: A position paper.

kotz:flexibility2:
David Kotz and Nils Nieuwejaar. Flexibility and performance of parallel file systems. In Proceedings of the Third International Conference of the Austrian Center for Parallel Computation (ACPC), volume 1127 of Lecture Notes in Computer Science, pages 1-11. Springer-Verlag, September 1996.
See also earlier version kotz:flexibility.

Abstract: As we gain experience with parallel file systems, it becomes increasingly clear that a single solution does not suit all applications. For example, it appears to be impossible to find a single appropriate interface, caching policy, file structure, or disk-management strategy. Furthermore, the proliferation of file-system interfaces and abstractions make applications difficult to port.

We propose that the traditional functionality of parallel file systems be separated into two components: a fixed core that is standard on all platforms, encapsulating only primitive abstractions and interfaces, and a set of high-level libraries to provide a variety of abstractions and application-programmer interfaces (APIs).

We present our current and next-generation file systems as examples of this structure. Their features, such as a three-dimensional file structure, strided read and write interfaces, and I/O-node programs, re specifically designed with the flexibility and performance necessary to support a wide range of applications.

Keywords: parallel I/O, multiprocessor file system, dfk, pario-bib

Comment: Nearly identical to kotz:flexibility. The only changes are the format, a shorter abstract, and updates to Section 7 and the references.

kotz:fsint:
David Kotz. Multiprocessor file system interfaces. Technical Report PCS-TR92-179, Dept. of Math and Computer Science, Dartmouth College, March 1992. Revised version appeared in PDIS'93.
See also later version kotz:fsint2.

Abstract: Increasingly, file systems for multiprocessors are designed with parallel access to multiple disks, to keep I/O from becoming a serious bottleneck for parallel applications. Although file system software can transparently provide high-performance access to parallel disks, a new file system interface is needed to facilitate parallel access to a file from a parallel application. We describe the difficulties faced when using the conventional (Unix-like) interface in parallel applications, and then outline ways to extend the conventional interface to provide convenient access to the file for parallel programs, while retaining the traditional interface for programs that have no need for explicitly parallel file access. Our interface includes a single naming scheme, a multiopen operation, local and global file pointers, mapped file pointers, logical records, multifiles, and logical coercion for backward compatibility.

Keywords: dfk, parallel I/O, multiprocessor file system, file system interface, pario-bib

Comment: See also lake:pario for implementation of some of the ideas.

kotz:fsint2:
David Kotz. Multiprocessor file system interfaces. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, pages 194-201. IEEE Computer Society Press, 1993.
See also earlier version kotz:fsint.

Abstract: Increasingly, file systems for multiprocessors are designed with parallel access to multiple disks, to keep I/O from becoming a serious bottleneck for parallel applications. Although file system software can transparently provide high-performance access to parallel disks, a new file system interface is needed to facilitate parallel access to a file from a parallel application. We describe the difficulties faced when using the conventional (Unix-like) interface in parallel applications, and then outline ways to extend the conventional interface to provide convenient access to the file for parallel programs, while retaining the traditional interface for programs that have no need for explicitly parallel file access. Our interface includes a single naming scheme, a multiopen operation, local and global file pointers, mapped file pointers, logical records, multifiles, and logical coercion for backward compatibility.

Keywords: dfk, parallel I/O, multiprocessor file system, file system interface, pario-bib

Comment: See also lake:pario for implementation of some of the ideas.

kotz:fsint2p:
David Kotz. Multiprocessor file system interfaces. In Proceedings of the USENIX File Systems Workshop, pages 149-150. USENIX Association, May 1992.
See also later version kotz:fsint2.

Keywords: dfk, parallel I/O, multiprocessor file system, file system interface, pario-bib

Comment: Short paper (2 pages).

kotz:int-ddio:
David Kotz. Interfaces for disk-directed I/O. Technical Report PCS-TR95-270, Dept. of Computer Science, Dartmouth College, September 1995.

Abstract: In other papers I propose the idea of disk-directed I/O for multiprocessor file systems. Those papers focus on the performance advantages and capabilities of disk-directed I/O, but say little about the application-programmer's interface or about the interface between the compute processors and I/O processors. In this short note I discuss the requirements for these interfaces, and look at many existing interfaces for parallel file systems. I conclude that many of the existing interfaces could be adapted for use in a disk-directed I/O system.

Keywords: disk-directed I/O, parallel I/O, multiprocessor filesystem interfaces, pario-bib, dfk

Comment: See also kotz:jdiskdir, kotz:expand, and kotz:lu.

kotz-i.bib:
David Kotz. Bibliography about Parallel I/O. Available on the WWW at https://www.cs.dartmouth.edu/pario/bib/, 1994-2000.

Keywords: parallel I/O, multiprocessor file system, dfk, pario-bib

Comment: A bibliography of many references on parallel I/O and multiprocessor file-systems issues. As of the fifth edition, it is available on the WWW in HTML format.

kotz:jdiskdir:
David Kotz. Disk-directed I/O for MIMD multiprocessors. ACM Transactions on Computer Systems, 15(1):41-74, February 1997.
Identical to kotz:bdiskdir.
See also earlier version kotz:diskdir-tr.

Abstract: Many scientific applications that run on today's multiprocessors, such as weather forecasting and seismic analysis, are bottlenecked by their file-I/O needs. Even if the multiprocessor is configured with sufficient I/O hardware, the file-system software often fails to provide the available bandwidth to the application. Although libraries and enhanced file-system interfaces can make a significant improvement, we believe that fundamental changes are needed in the file-server software. We propose a new technique, disk-directed I/O, to allow the disk servers to determine the flow of data for maximum performance. Our simulations show that tremendous performance gains are possible both for simple reads and writes and for an out-of-core application. Indeed, our disk-directed I/O technique provided consistent high performance that was largely independent of data distribution, obtained up to 93% of peak disk bandwidth, and was as much as 18 times faster than the traditional technique.

Keywords: parallel I/O, multiprocessor file system, file system caching, dfk, pario-bib

Comment: This paper is a substantial revision of the diskdir-tr version: all of the experiments have been re-done, using a better-tuned version of the file systems (see kotz:tuning), and adding two-phase I/O to all comparisons. It also incorporates some of the material from kotz:expand and kotz:int-ddio. Also available at http://www.acm.org/pubs/citations/journals/tocs/1997-15-1/p41-kotz/.

kotz:jpractical:
David Kotz and Carla Schlatter Ellis. Practical prefetching techniques for multiprocessor file systems. Journal of Distributed and Parallel Databases, 1(1):33-51, January 1993.
Identical to kotz:bpractical.
See also earlier version kotz:practical.

Abstract: Improvements in the processing speed of multiprocessors are outpacing improvements in the speed of disk hardware. Parallel disk I/O subsystems have been proposed as one way to close the gap between processor and disk speeds. In a previous paper we showed that prefetching and caching have the potential to deliver the performance benefits of parallel file systems to parallel applications. In this paper we describe experiments with practical prefetching policies that base decisions only on on-line reference history, and that can be implemented efficiently. We also test the ability of these policies across a range of architectural parameters.

Keywords: dfk, parallel file system, prefetching, disk caching, parallel I/O, MIMD, pario-bib

Comment: See also kotz:jwriteback, kotz:fsint2, cormen:integrate.

kotz:jworkload:
David Kotz and Nils Nieuwejaar. File-system workload on a scientific multiprocessor. IEEE Parallel and Distributed Technology, 3(1):51-60, Spring 1995.
See also earlier version kotz:workload.
See also later version nieuwejaar:workload-tr.

Keywords: parallel file system, file access pattern, multiprocessor file system workload, parallel I/O, pario-bib, dfk

kotz:jwriteback:
David Kotz and Carla Schlatter Ellis. Caching and writeback policies in parallel file systems. Journal of Parallel and Distributed Computing, 17(1-2):140-145, January and February 1993.
See also earlier version kotz:writeback.

Abstract: Improvements in the processing speed of multiprocessors are outpacing improvements in the speed of disk hardware. Parallel disk I/O subsystems have been proposed as one way to close the gap between processor and disk speeds. Such parallel disk systems require parallel file system software to avoid performance-limiting bottlenecks. We discuss cache management techniques that can be used in a parallel file system implementation for multiprocessors with scientific workloads. We examine several writeback policies, and give results of experiments that test their performance.

Keywords: dfk, parallel file system, disk caching, parallel I/O, MIMD, pario-bib

Comment: See kotz:jpractical, kotz:fsint2, cormen:integrate.

kotz:lu:
David Kotz. Disk-directed I/O for an out-of-core computation. In Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing, pages 159-166. IEEE Computer Society Press, August 1995.
See also earlier version kotz:lu-tr.

Abstract: New file systems are critical to obtain good I/O performance on large multiprocessors. Several researchers have suggested the use of collective file-system operations, in which all processes in an application cooperate in each I/O request. Others have suggested that the traditional low-level interface (read, write, seek) be augmented with various higher-level requests (e.g., read matrix). Collective, high-level requests permit a technique called disk-directed I/O to significantly improve performance over traditional file systems and interfaces, at least on simple I/O benchmarks. In this paper, we present the results of experiments with an ``out-of-core'' LU-decomposition program. Although its collective interface was awkward in some places, and forced additional synchronization, disk-directed I/O was able to obtain much better overall performance than the traditional system.

Keywords: parallel I/O, numerical analysis, dfk, pario-bib

kotz:lu-tr:
David Kotz. Disk-directed I/O for an out-of-core computation. Technical Report PCS-TR95-251, Dept. of Computer Science, Dartmouth College, January 1995.
See also later version kotz:lu.

Abstract: New file systems are critical to obtain good I/O performance on large multiprocessors. Several researchers have suggested the use of collective file-system operations, in which all processes in an application cooperate in each I/O request. Others have suggested that the traditional low-level interface (read, write, seek) be augmented with various higher-level requests (e.g., read matrix), allowing the programmer to express a complex transfer in a single (perhaps collective) request. Collective, high-level requests permit techniques like two-phase I/O and disk-directed I/O to significantly improve performance over traditional file systems and interfaces. Neither of these techniques have been tested on anything other than simple benchmarks that read or write matrices. Many applications, however, intersperse computation and I/O to work with data sets that cannot fit in main memory. In this paper, we present the results of experiments with an ``out-of-core'' LU-decomposition program, comparing a traditional interface and file system with a system that has a high-level, collective interface and disk-directed I/O. We found that a collective interface was awkward in some places, and forced additional synchronization. Nonetheless, disk-directed I/O was able to obtain much better performance than the traditional system.

Keywords: parallel I/O, numerical analysis, dfk, pario-bib

kotz:pioarch:
David Kotz. Introduction to multiprocessor I/O architecture. In Ravi Jain, John Werth, and James C. Browne, editors, Input/Output in Parallel and Distributed Computer Systems, volume 362 of The Kluwer International Series in Engineering and Computer Science, chapter 4, pages 97-123. Kluwer Academic Publishers, 1996.

Abstract: The computational performance of multiprocessors continues to improve by leaps and bounds, fueled in part by rapid improvements in processor and interconnection technology. I/O performance thus becomes ever more critical, to avoid becoming the bottleneck of system performance. In this paper we provide an introduction to I/O architectural issues in multiprocessors, with a focus on disk subsystems. While we discuss examples from actual architectures and provide pointers to interesting research in the literature, we do not attempt to provide a comprehensive survey. We concentrate on a study of the architectural design issues, and the effects of different design alternatives.

Keywords: parallel I/O, multiprocessor file system, pario-bib, dfk

Comment: Invited paper. Part of a whole book on parallel I/O; see iopads-book.

kotz:practical:
David Kotz and Carla Schlatter Ellis. Practical prefetching techniques for parallel file systems. In Proceedings of the First International Conference on Parallel and Distributed Information Systems, pages 182-189. IEEE Computer Society Press, December 1991.
See also earlier version kotz:thesis.
See also later version kotz:jpractical.

Abstract: Parallel disk subsystems have been proposed as one way to close the gap between processor and disk speeds. In a previous paper we showed that prefetching and caching have the potential to deliver the performance benefits of parallel file systems to parallel applications. In this paper we describe experiments with practical prefetching policies, and show that prefetching can be implemented efficiently even for the more complex parallel file access patterns. We test these policies across a range of architectural parameters.

Keywords: dfk, parallel file system, prefetching, disk caching, parallel I/O, MIMD, OS93W extra, OS92W, pario-bib

Comment: Short form of primary thesis results. See kotz:jwriteback, kotz:fsint2, cormen:integrate.

kotz:prefetch:
David F. Kotz and Carla Schlatter Ellis. Prefetching in file systems for MIMD multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(2):218-230, April 1990.
See also earlier version ellis:prefetch.
See also later version kotz:thesis.

Abstract: The problem of providing file I/O to parallel programs has been largely neglected in the development of multiprocessor systems. There are two essential elements of any file system design intended for a highly parallel environment: parallel I/O and effective caching schemes. This paper concentrates on the second aspect of file system design and specifically, on the question of whether prefetching blocks of the file into the block cache can effectively reduce overall execution time of a parallel computation, even under favorable assumptions.

Experiments have been conducted with an interleaved file system testbed on the Butterfly Plus multiprocessor. Results of these experiments suggest that 1) the hit ratio, the accepted measure in traditional caching studies, may not be an adequate measure of performance when the workload consists of parallel computations and parallel file access patterns, 2) caching with prefetching can significantly improve the hit ratio and the average time to perform an I/O operation, and 3) an improvement in overall execution time has been observed in most cases. In spite of these gains, prefetching sometimes results in increased execution times (a negative result, given the optimistic nature of the study).

We explore why is it not trivial to translate savings on individual I/O requests into consistently better overall performance and identify the key problems that need to be addressed in order to improve the potential of prefetching techniques in this environment.

Keywords: dfk, parallel file system, prefetching, MIMD, disk caching, parallel I/O, pario-bib

kotz:thesis:
David Kotz. Prefetching and Caching Techniques in File Systems for MIMD Multiprocessors. PhD thesis, Duke University, April 1991. Available as technical report CS-1991-016.

Abstract: The increasing speed of the most powerful computers, especially multiprocessors, makes it difficult to provide sufficient I/O bandwidth to keep them running at full speed for the largest problems. Trends show that the difference in the speed of disk hardware and the speed of processors is increasing, with I/O severely limiting the performance of otherwise fast machines. This widening access-time gap is known as the ``I/O bottleneck crisis.'' One solution to the crisis, suggested by many researchers, is to use many disks in parallel to increase the overall bandwidth.

This dissertation studies some of the file system issues needed to get high performance from parallel disk systems, since parallel hardware alone cannot guarantee good performance. The target systems are large MIMD multiprocessors used for scientific applications, with large files spread over multiple disks attached in parallel. The focus is on automatic caching and prefetching techniques. We show that caching and prefetching can transparently provide the power of parallel disk hardware to both sequential and parallel applications using a conventional file system interface. We also propose a new file system interface (compatible with the conventional interface) that could make it easier to use parallel disks effectively.

Our methodology is a mixture of implementation and simulation, using a software testbed that we built to run on a BBN GP1000 multiprocessor. The testbed simulates the disks and fully implements the caching and prefetching policies. Using a synthetic workload as input, we use the testbed in an extensive set of experiments. The results show that prefetching and caching improved the performance of parallel file systems, often dramatically.

Keywords: dfk, parallel file system, prefetching, MIMD, disk caching, parallel I/O, pario-bib

Comment: Published as kotz:prefetch, kotz:jwriteback, kotz:jpractical, kotz:fsint2.

kotz:throughput:
David Kotz. Throughput of existing multiprocessor file systems. Technical Report PCS-TR93-190, Dept. of Math and Computer Science, Dartmouth College, May 1993.

Keywords: parallel I/O, multiprocessor file system, performance, survey, dfk, pario-bib

Comment: A brief note on the reported performance of existing file systems (Intel CFS, nCUBE, CM-2, CM-5, and Cray). Many have disappointingly low absolute throughput, in MB/s.

kotz:tuning:
David Kotz. Tuning STARFISH. Technical Report PCS-TR96-296, Dept. of Computer Science, Dartmouth College, October 1996.

Abstract: STARFISH is a parallel file-system simulator we built for our research into the concept of disk-directed I/O. In this report, we detail steps taken to tune the file systems supported by STARFISH, which include a traditional parallel file system (with caching) and a disk-directed I/O system. In particular, we now support two-phase I/O, use smarter disk scheduling, increased the maximum number of outstanding requests that a compute processor may make to each disk, and added gather/scatter block transfer. We also present results of the experiments driving the tuning effort.

Keywords: parallel I/O, multiprocessor file system, dfk, pario-bib

Comment: Reports on some new changes to the STARFISH simulator that implements traditional caching and disk-directed I/O. This is meant mainly as a companion to kotz:jdiskdir. See also kotz:jdiskdir, kotz:diskdir, kotz:expand.

kotz:workload:
David Kotz and Nils Nieuwejaar. Dynamic file-access characteristics of a production parallel scientific workload. In Proceedings of Supercomputing '94, pages 640-649, Washington, DC, November 1994. IEEE Computer Society Press.
See also earlier version kotz:workload-tr.
See also later version kotz:jworkload.

Abstract: Multiprocessors have permitted astounding increases in computational performance, but many cannot meet the intense I/O requirements of some scientific applications. An important component of any solution to this I/O bottleneck is a parallel file system that can provide high-bandwidth access to tremendous amounts of data in parallel to hundreds or thousands of processors.

Most successful systems are based on a solid understanding of the characteristics of the expected workload, but until now there have been no comprehensive workload characterizations of multiprocessor file systems. We began the CHARISMA project in an attempt to fill that gap. We instrumented the common node library on the iPSC/860 at NASA Ames to record all file-related activity over a two-week period. Our instrumentation is different from previous efforts in that it collects information about every read and write request and about the mix of jobs running in the machine (rather than from selected applications).

The trace analysis in this paper leads to many recommendations for designers of multiprocessor file systems. First, the file system should support simultaneous access to many different files by many jobs. Second, it should expect to see many small requests, predominantly sequential and regular access patterns (although of a different form than in uniprocessors), little or no concurrent file-sharing between jobs, significant byte- and block-sharing between processes within jobs, and strong interprocess locality. Third, our trace-driven simulations showed that these characteristics led to great success in caching, both at the compute nodes and at the I/O nodes. Finally, we recommend supporting strided I/O requests in the file-system interface, to reduce overhead and allow more performance optimization by the file system.

Keywords: parallel file system, file access pattern, multiprocessor file system workload, parallel I/O, pario-bib, dfk

Comment: Also at http://www.acm.org/pubs/citations/proceedings/supercomputing/198354/p640-kotz and http://computer.org/conferen/sc94/kotz.html

kotz:workload-tr:
David Kotz and Nils Nieuwejaar. Dynamic file-access characteristics of a production parallel scientific workload. Technical Report PCS-TR94-211, Dept. of Math and Computer Science, Dartmouth College, April 1994. Revised May 11, 1994.
See also later version kotz:workload.

Abstract: Multiprocessors have permitted astounding increases in computational performance, but many cannot meet the intense I/O requirements of some scientific applications. An important component of any solution to this I/O bottleneck is a parallel file system that can provide high-bandwidth access to tremendous amounts of data in parallel to hundreds or thousands of processors.

Most successful systems are based on a solid understanding of the characteristics of the expected workload, but until now there have been no comprehensive workload characterizations of multiprocessor file systems. We began the CHARISMA project in an attempt to fill that gap. We instrumented the common node library on the iPSC/860 at NASA Ames to record all file-related activity over a two-week period. Our instrumentation is different from previous efforts in that it collects information about every read and write request and about the mix of jobs running in the machine (rather than from selected applications).

The trace analysis in this paper leads to many recommendations for designers of multiprocessor file systems. First, the file system should support simultaneous access to many different files by many jobs. Second, it should expect to see many small requests, predominantly sequential and regular access patterns (although of a different form than in uniprocessors), little or no concurrent file-sharing between jobs, significant byte- and block-sharing between processes within jobs, and strong interprocess locality. Third, our trace-driven simulations showed that these characteristics led to great success in caching, both at the compute nodes and at the I/O nodes. Finally, we recommend supporting strided I/O requests in the file-system interface, to reduce overhead and allow more performance optimization by the file system.

Keywords: parallel file system, file access pattern, multiprocessor file system workload, parallel I/O, pario-bib, dfk

kotz:writeback:
David Kotz and Carla Schlatter Ellis. Caching and writeback policies in parallel file systems. In IEEE Symposium on Parallel and Distributed Processing, pages 60-67. IEEE Computer Society Press, December 1991.
See also earlier version kotz:thesis.
See also later version kotz:jwriteback.

Abstract: Improvements in the processing speed of multiprocessors are outpacing improvements in the speed of disk hardware. Parallel disk I/O subsystems have been proposed as one way to close the gap between processor and disk speeds. Such parallel disk systems require parallel file system software to avoid performance-limiting bottlenecks. We discuss cache management techniques that can be used in a parallel file system implementation. We examine several writeback policies, and give results of experiments that test their performance.

Keywords: dfk, parallel file system, disk caching, parallel I/O, MIMD, pario-bib

Comment: See also kotz:jpractical, kotz:fsint2, cormen:integrate.

krammer:marmot:
Bettina Krammer, Matthias S. Müller, and Michael M. Resch. MPI I/O analysis and error detection with MARMOT. Lecture Notes in Computer Science, 3241:242-250, September 2004.

Abstract: The most frequently used part of MPI-2 is MPI I/O. Due to the complexity of parallel programming in general, and of handling parallel I/O in particular, there is a need for tools that support the application development process. There axe many situations where incorrect usage of MPI by the application programmer can be automatically detected. In this paper we describe the MARMOT tool that uncovers some of these errors and we also analyze to what extent it is possible to do so for MPI I/O.

Keywords: MPI I/O, error detection, performance analysis, MARMOT, pario-bib

krieger:asf:
Orran Krieger, Michael Stumm, and Ronald Unrau. The Alloc Stream Facility: A redesign of application-level stream I/O. IEEE Computer, 27(3):75-82, March 1994.
See also earlier version krieger:asf-tr.

Keywords: memory-mapped file, file system, parallel I/O, pario-bib

krieger:asf-tr:
Orran Krieger, Michael Stumm, and Ronald Unrau. The Alloc Stream Facility: A redesign of application-level stream I/O. Technical Report CSRI-275, Computer Systems Research Institute, University of Toronto, Toronto, Canada, M5S 1A1, October 1992.
See also later version krieger:asf.

Abstract: This paper describes the design and implementation of a new application level I/O facility, called the Alloc Stream Facility. The Alloc Stream Facility has several key advantages. First, performance is substantially improved as a result of a) the structure of the facility that allows it to take advantage of system specific features like mapped files, and b) a reduction in data copying and the number of I/O system calls. Second, the facility is designed for multi-threaded applications running on multiprocessors and allows for a high degree of concurrency. Finally, the facility can support a variety of I/O interfaces, including stdio, emulated Unix I/O, ASI, and C++ streams, in a way that allows applications to freely intermix calls to the different interfaces, resulting in improved code reusability.

We show that on several Unix workstation platforms the performance of Unix applications using the Alloc Stream Facility can be substantially better that when the applications use the original I/O facilities.

Keywords: memory-mapped file, file system, parallel I/O, pario-bib

Comment: See also krieger:mapped. ``This is an extended version of the paper with the same title in the March, 1994 edition of IEEE Computer.'' A 3-level interface structure: interface, backplane, and stream-specific modules. Different interfaces available: unix, stdio, ASI (theirs), C++. Common backplane. Stream-specific implementations that export operations like salloc and sfree, which return pointers to data buffers. ASI exports that interface to the user, for maximum efficiency. Performance is best when using mapped files as underlying implementation. Many stdio or unix apps are faster only after relinking. ASI is even faster. In addition to better performance, also get multithreading support, multiple interfaces, and extensibility.

krieger:hfs:
Orran Krieger and Michael Stumm. HFS: a flexible file system for large-scale multiprocessors. In Proceedings of the 1993 DAGS/PC Symposium, pages 6-14, Hanover, NH, June 1993. Dartmouth Institute for Advanced Graduate Studies.
See also later version krieger:hfs2.

Abstract: The {H{\sc urricane}} File System (HFS) is a new file system being developed for large-scale shared memory multiprocessors with distributed disks. The main goal of this file system is scalability; that is, the file system is designed to handle demands that are expected to grow linearly with the number of processors in the system. To achieve this goal, HFS is designed using a new structuring technique called Hierarchical Clustering. HFS is also designed to be flexible in supporting a variety of policies for managing file data and for managing file system state. This flexibility is necessary to support in a scalable fashion the diverse workloads we expect for a multiprocessor file system.

Keywords: multiprocessor file system, parallel I/O, operating system, shared memory, pario-bib

Comment: This paper is now out of date; see krieger:thesis. Designed for scalability on the hierarchical clustering model (see unrau:cluster), the Hurricane File System for NUMA shared-memory MIMD machines. Each cluster has its own full file system, which communicates with those in other clusters. Pieces are name server, open-file server, and block-file server. On first access, the file is mapped into the application space. VM system calls BFS to arrange transfers. Open questions: policies for file state management, block distribution, caching, and prefetching. Object-oriented approach used to allow for flexibility and extendability. Local disk file systems are log-structured.

krieger:hfs2:
Orran Krieger and Michael Stumm. HFS: A performance-oriented flexible file system based on building-block compositions. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 95-108, Philadelphia, May 1996. ACM Press.
See also earlier version krieger:hfs.
See also later version krieger:hfs3.

Abstract: The Hurricane File System (HFS) is designed for (potentially large-scale) shared memory multiprocessors. Its architecture is based on the principle that, in order to maximize performance for applications with diverse requirements, a file system must support a wide variety of file structures, file system policies and I/O interfaces. Files in HFS are implemented using simple building blocks composed in potentially complex ways. This approach yields great flexibility, allowing an application to customize the structure and policies of a file to exactly meet its requirements. For example, a file's structure can be optimized for concurrent random-access write-only operations by ten processes. Similarly, the prefetching, locking, and file cache management policies can all be chosen to match an application's access pattern. In contrast, most existing parallel file systems support a single file structure and a small set of policies.

We have implemented HFS as part of the Hurricane operating system running on the Hector shared memory multiprocessor. We demonstrate that the flexibility of HFS comes with little processing or I/O overhead. We also show that for a number of file access patterns HFS is able to deliver to the applications the full I/O bandwidth of the disks on our system.

Keywords: parallel I/O, parallel file system, object-oriented, pario-bib

Comment: A published form of krieger:hfs and the thesis krieger:thesis. Their main point is that the file system is constructed from building-block objects. When you create a file you choose a few building blocks, for example, a replication block that mirrors the file, and some distribution blocks that distribute each replica across a set of disks. When you open the file you plug in some more building blocks, e.g., to do prefetching or to provide the kind of interface that you want to use. They point out that this flexibility is critical to be able to get good performance, because different file-access patterns need different structures and policies. They found that mapped files minimize copying costs and improve performance. They were able to obtain full disk bandwidth. Great paper.

krieger:hfs3:
Orran Krieger and Michael Stumm. HFS: A performance-oriented flexible file system based on building-block compositions. ACM Transactions on Computer Systems, 15(3):286-321, August 1997.
See also earlier version krieger:hfs2.

Abstract: The Hurricane File System (HFS) is designed for (potentially large-scale) shared-memory multiprocessors. Its architecture is based on the principle that, in order to maximize performance for applications with diverse requirements, a file system must support a wide variety of file structures, file system policies, and I/O interfaces. Files in HFS are implemented using simple building blocks composed in potentially complex ways. This approach yields great flexibility, allowing an application to customize the structure and policies of a file to exactly meet its requirements. As an extreme example, HFS allows a file's structure to be optimized for concurrent random-access write-only operations by 10 threads, something no other file system can do. Similarly, the prefetching, locking, and file cache management policies can all be chosen to match an application's access pattern. In contrast, most parallel file systems support a single file structure and a small set of policies. We have implemented HFS as part of the Hurricane operating system running on the Hector shared-memory multiprocessor. We demonstrate that the flexibility of HFS comes with little processing or I/O overhead. We also show that for a number of file access patterns, HFS is able to deliver to the applications the full I/O bandwidth of the disks on our system.

Keywords: parallel I/O, parallel file system, object-oriented, pario-bib

krieger:thesis:
Orran Krieger. HFS: A flexible file system for shared-memory multiprocessors. PhD thesis, University of Toronto, October 1994.

Abstract: The Hurricane File System (HFS) is designed for large-scale, shared-memory multiprocessors. Its architecture is based on the principle that a file system must support a wide variety of file structures, file system policies and I/O interfaces to maximize performance for a wide variety of applications. HFS uses a novel, object-oriented building-block approach to provide the flexibility needed to support this variety of file structures, policies, and I/O interfaces. File structures can be defined in HFS that optimize for sequential or random access, read-only, write-only or read/write access, sparse or dense data, large or small file sizes, and different degrees of application concurrency. Policies that can be defined on a per-file or per-open instance basis include locking policies, prefetching policies, compression/decompression policies and file cache management policies. In contrast, most existing file systems have been designed to support a single file structure and a small set of policies.

We have implemented large portions of HFS as part of the Hurricane operating system running on the Hector shared-memory multiprocessor. We demonstrate that the flexibility of HFS comes with little processing or I/O overhead. Also, we show that HFS is able to deliver the full I/O bandwidth of the disks on our system to the applications.

Keywords: parallel I/O, multiprocesor file system, shared memory, memory-mapped I/O, pario-bib

Comment: Excellent work. HFS uses an object-oriented building-block approach to provide flexible, scalable high performance. Indeed, HFS appears to be one of the most flexible parallel file systems available, allowing users to independently control (or redefine) policies for prefetching, caching, redundancy and fault tolerance, and declustering.

krystynak:datavault:
John Krystynak. I/O performance on the Connection Machine DataVault system. Technical Report RND-92-011, NAS Systems Division, NASA Ames, May 1992.
See also later version krystynak:pario.

Keywords: parallel I/O, parallel file system, parallel I/O, performance measurement, pario-bib

Comment: Short measurements of CM-2 Datavault. Faster if you access through Paris. Can get nearly full 32 MB/s bandwidth. Problem in its ability to use multiple CMIO busses.

krystynak:pario:
John Krystynak and Bill Nitzberg. Performance characteristics of the iPSC/860 and CM-2 I/O systems. In Proceedings of the Seventh International Parallel Processing Symposium, pages 837-841, Newport Beach, CA, 1993. IEEE Computer Society Press.
See also earlier version krystynak:datavault.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: Essentially a (short) combination of krystynak:datavault and nitzberg:cfs.

kucera:libc:
Julie Kucera. Making \em libc suitable for use by parallel programs. In Proceedings of the USENIX Distributed and Multiprocessor Systems Workshop, pages 145-152, 1989.

Keywords: parallel file system interface, pario-bib

Comment: Experience making libc reentrant, adding semaphores, etc., on a Convex. Some problems with I/O. Added semaphores and private memory to make libc calls reentrant, i.e., callable in parallel by multiple threads.

kumar:thesis:
Alok Kumar. SysProView: A framework for visualizing the activities of multiprocessor file systems. Master's thesis, Thayer School of Engineering, Dartmouth College, 1993.

Keywords: parallel I/O, pario-bib

Comment: A visualization tool, now long gone, for display of CHARISMA trace files. See nieuwejaar:workload for details of CHARISMA.

kuo:blackhole:
S. Kuo, M. Winslett, Y. Chen, Y. Cho, M. Subramaniam, and K. Seamons. Application experience with parallel input/output: Panda and the H3expresso black hole simulation on the SP2. In Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing, 1997.

Abstract: The paper summarizes our experiences using the Panda parallel I/O library with the H3expresso numerical relativity code on the Cornell SP2. Two performance issues are described: providing efficient off-loading of output data, and satisfying users' desire to dedicate fewer nodes to I/O. We explore the tradeoffs between potential solutions, and present performance results for our approaches. We also show that Panda's high level interface, which allows the user to request input or output of a set of arrays with a single command, is a good match for H3expresso's needs

Keywords: application experience, parallel input/output, parallel I/O, performance issues, multiprocessor file system interface, pario-bib

kuo:efficient:
S. Kuo, M. Winslett, Y. Cho, J. Lee, and Y. Chen. Efficient input and output for scientific simulations. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems, pages 33-44, Atlanta, GA, May 1999. ACM Press.

Abstract: Large simulations which run for hundreds of hours on parallel computers often periodically generate snapshots of states, which are later post-processed to visualize the simulated physical phenomenon. For many applications, fast I/O during post-processing, which is dependent on an efficient organization of data on disk, is as important as minimizing computation-time I/O. In this paper we propose optimizations to support efficient parallel I/O for scientific simulations and subsequent visualizations. We present an ordering mechanism to linearize data on disk, a performance model to help to choose a proper stripe unit size, and a scheduling algorithm to minimize communication contention. Our experiments on an IBM SP show that the combination of these strategies provides a 20-25% performance boost.

Keywords: scientific computing, simulation, parallel I/O, pario-bib

kurc:query:
Tahsin Kurc, Chialin Chang, Renato Ferreira, and Alan Sussman. Querying very large multi-dimensional datasets in ADR. In Proceedings of SC99: High Performance Networking and Computing, Portland, OR, November 1999. ACM Press and IEEE Computer Society Press.

Keywords: scientific applications, query-based interface, parallel I/O, pario-bib

Comment: They describe an architecture for accessing data in scientific datasets by performing range queries (a multidimensional bounding box) over the data. This type of access mechanism is useful for applications like satellite imaging.

kwan:cm5io:
Thomas T. Kwan and Daniel A. Reed. Performance of the CM-5 scalable file system. In Proceedings of the 8th ACM International Conference on Supercomputing, pages 156-165, Manchester, UK, July 1994. ACM Press.

Keywords: parallel I/O, parallel architecture, multiprocessor file system, pario-bib

Comment: They measure the performance of the CM-5 Scalable File System using synthetic benchmarks. They compare CM-Fortran with CMMD. The hardware-dependent (``physical'') modes were much faster than the generic-format modes, which have to reorder data between the processor distribution and the disk distribution. The network turned out to be a bottleneck for the performance when reordering was needed. They conclude that more user control over the I/O would be very helpful.

kwan:sort:
Sai Choi Kwan. External Sorting: I/O Analysis and Parallel Processing Techniques. PhD thesis, University of Washington, January 1986. Available as technical report 86-01-01.

Keywords: parallel I/O, sorting, pario-bib

Comment: Examines external sorting techniques such as merge sort, tag sort, multi-pass distribution sort, and one-pass distribution sort. The model is one where I/O complexity is included, assuming a linear seek time distribution and a cost of 1/2 rotation for each seek. Parallel I/O or computing are not considered until the distribution sorts. Architectural model on page 58.

kwong:distribution:
Peter Kwong and Shikaresh Majumdar. Study of data distribution strategies for parallel I/O management. In Proceedings of the Third International Conference of the Austrian Center for Parallel Computation (ACPC), volume 1127 of Lecture Notes in Computer Science, pages 12-23. Springer-Verlag, September 1996.

Abstract: Recent studies have demonstrated that a significant number of I/O operations are performed by a number of classes of different parallel applications. Appropriate I/O management strategies are required however for harnessing the power of parallel I/O. This paper focuses on two I/O management issues that affect system performance in multiprogrammed parallel environments. Characterization of I/O behavior of parallel applications in terms of four different models is discussed first, followed by an investigation of the performance of a number of different data distribution strategies. Using computer simulations this research shows that I/O characteristics of applications and data distribution have an important effect on system performance. Applications that can simultaneously do computation and I/O, plus strategies that can incorporate centralized I/O management are found to be beneficial for a multiprogrammed parallel environment.

Keywords: parallel I/O, pario-bib

Comment: See majumdar:management.

kwong:scheduling:
Peter Kwong and Shikharesh Majumdar. Scheduling of I/O in multiprogrammed parallel systems. Informatica, 23(1):67-76, April 1999.

Abstract: Recent studies have demonstrated that significant I/O is performed by a number of parallel applications. In addition to running these applications on multiple processors, the parallelization of I/O operations and the use of multiple disk drives are required for achieving high system performance. This research is concerned with the effective management of parallel I/O by using appropriate I/O scheduling strategies. Based on a simulation model the performance of a number of scheduling policies are investigated. Using I/O characteristics of jobs such as the total outstanding I/O demand is observed to be useful in devising effective scheduling strategies.

Keywords: parallel I/O, scheduling, pario-bib

lake:pario:
Brian Lake and Chris Gray. Parallel I/O for MIMD machines. In Proceedings of SS'93: High Performance Computing, pages 301-308, Calgary, June 1993.

Keywords: parallel I/O, MIMD, multiprocessor file system, pario-bib

Comment: They describe the I/O system for the Myrias SPS-3 parallel computer. The SPS is a no-remote-access (NORMA) machine with a software shared memory abstraction. They provide a standard C/FORTRAN I/O interface, with a few extensions. The user's parallel program is considered a client, and an I/O processor (IOP) is the server. No striping across IOPs, which makes it relatively simple for them to have the server manage the shared file pointer. Their extensions allow atomic, file-pointer update, returning the actual position where I/O occurred, and atomic access to fixed- and variable-length records. They have three protocols, for different transfer sizes; small using simple request/response; medium using sliding window; and large using scatter/gather and special hardware double buffering at the IOP. They use scatter/gather DMA, and page-table fiddling, for messaging. Performance is 89-96% of hardware peak, limited by IOP's VME backplane.

large-scale-memories:
Special issue on large-scale memories. Algorithmica, 1994.

Keywords: parallel I/O, algorithms, pario-bib

latham:mpi-io-scalability:
Rob Latham, Rob Ross, and Rajeev Thakur. The impact of file systems on MPI-IO scalability. Lecture Notes in Computer Science, 3241:87-96, November 2004.

Abstract: As the number of nodes in cluster systems continues to grow, leveraging scalable algorithms in all aspects of such systems becomes key to maintaining performance. While scalable algorithms have been applied successfully in some areas of parallel I/O, many operations are still performed in an uncoordinated manner. In this work we consider, in three file system scenarios, the possibilities for applying scalable algorithms to the many operations that make up the MPI-IO interface. From this evaluation we extract a set of file system characteristics that aid in developing scalable MPI-IO implementations.

Keywords: scalability analysis, MPI-IO, pario-bib

latham:pvfs2:
Rob Latham, Neil Miller, Robert Ross, and Phil Carns. A next-generation parallel file system for linux clusters. LinuxWorld, 2(1), January 2004.

Keywords: pvfs2, parallel file system, pario-bib

latifi:network:
S. Latifi, M. Moraes de Azevedo, and N. Bagherzadeh. A star-based I/O-bounded network for massively parallel systems. IEE Proceedings- Computers and Digital Techniques, 142(1):5-14, January 1995.

Abstract: The paper describes a new interconnection network for massively parallel systems, referred to as star-connected cycles (SCC). The SCC graph presents an I/O-bounded structure that results in several advantages over variable degree graphs like the star and the hypercube. The description of the SCC graph includes issues such as labelling of nodes, degree, diameter and symmetry. The paper also presents an optimal routeing algorithm for the SCC and efficient broadcasting algorithms with O(n) running time, with n being the dimensionality of the graph. A comparison with the cube-connected cycles (CCC) and other interconnection networks is included, indicating that, for even n, an n-SCC and a CCC of similar sizes have about the same diameter. In addition, it is shown that one-port broadcasting in an n-SCC graph can be accomplished with a running time better than or equal to that required by an n-star containing (n-1) times fewer nodes.

Keywords: parallel I/O, parallel computer architecture, pario-bib

lauria:server:
Mario Lauria, Keith Bell, and Andrew Chien. A high-performance cluster storage server. In Proceedings of the Eleventh IEEE International Symposium on High Performance Distributed Computing, pages 311-320, Edinburgh, Scotland, 2002. IEEE Computer Society Press.

Keywords: srb, performance-related optimization, pario, pario-bib

Comment: SRB data transfer optimization on cluster storage servers. If disk-bound, the system transfers from server to disks are broken so that protocol processing and disk transfer are pipelined. If network bound, stripe transfer from multiple clients to multiple servers. No mention of remote execution.

lautenbach:pfs:
Berin F. Lautenbach and Bradley M. Broom. A parallel file system for the AP1000. In Proceedings of the Third Fujitsu-ANU CAP Workshop, November 1992.

Keywords: distributed file system, multiprocessor file system, pario-bib

Comment: See also broom:acacia, broom:impl, mutisya:cache, and broom:cap. The Acacia file system has file access modes that are much like those in Intel CFS and TMC CMMD. By default all processes have their own file pointer, but they can switch to another mode either all together or in row- or column-subsets. The other modes include a replicated mode (where all read or write the same data), and a variety of shared modes, with arbitrary, fixed, or unspecified ordering among processors, and with fixed or variable-sized records. They also have a parallel-open operation, support for logical records, control over the striping width (number of disks) and height (block size), and control over of redundancy. A prototype is running.

lawlor:parity:
F. D. Lawlor. Efficient mass storage parity recovery mechanism. IBM Technical Disclosure Bulletin, 24(2):986-987, July 1981.

Keywords: parallel I/O, disk array, RAID, pario-bib

Comment: An early paper, perhaps the earliest, that describes the techniques that later became RAID. Lawlor notes how to use parity to recover data lost due to disk crash, as in RAID3, addresses the read-before-write problem by caching the old data block as well as the new data block, and shows how two-dimensional parity can protect against two or more failures.

lee:bparity:
Edward K. Lee and Randy H. Katz. The performance of parity placements in disk arrays. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 3, pages 35-54. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version lee:jparity.

Keywords: RAID, disk array, reliability, parallel I/O, pario-bib

Comment: Part of jin:io-book; reformatted version of lee:jparity.

lee:bpetal:
Edward K.Lee and Chandramohan A. Thekkath. Petal: Distributed virtual disks. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 27, pages 420-430. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version lee:petal.

Keywords: parallel I/O, distributed file system, declustering, reliability, pario-bib

Comment: Part of jin:io-book; reformatted version of lee:petal.

lee:comparison:
K. K. Lee, M. Kallahalla, B. S. Lee, and P. J. Varman. Performance comparison of prefetching and placement policies for parallel I/O. International Journal of Parallel and Distributed Systems and Networks, 5(2):76-84, 2002.

Keywords: parallel I/O, file prefetching, pario-bib

lee:external:
Jang Sun Lee, Sunghoon Ko, Sanjay Ranka, and Byung Eui Min. High-performance external computations using user-controllable I/O. In Proceedings of the Joint International Parallel Processing Symposium and IEEE Symposium on Parallel and Distributed Processing, pages 303-307. IEEE Computer Society Press, March 1998.

Keywords: parallel I/O, pario-bib

lee:file-assignment:
Lin-Wen Lee, Peter Scheuermann, and Radek Vingralek. File assignment in parallel I/O systems with minimal variance of service time. IEEE Transactions on Computers, 49(2):127-140, February 2000.

Abstract: We address the problem of assigning nonpartitioned files in a parallel I/O system where the file accesses exhibit Poisson arrival rates and fixed service times. We present two new file assignment algorithms based on open queuing networks which aim at minimizing simultaneously the load balance across all disks, as well as the variance of the service time at each disk. We first present an off-line algorithm, Sort Partition, which assigns to each disk files with similar access time. Next, we show that, assuming that a perfectly balanced file assignment can be found for a given set of files, Sort Partition will find the one with minimal mean response time. We then present an on-line algorithm, Hybrid Partition, that assigns groups of files with similar service times in successive intervals while guaranteeing that the load imbalance at any point does not exceed a certain threshold. We report on synthetic experiments which exhibit skew in file accesses and sizes and we compare the performance of our new algorithms with the vanilla greedy file allocation algorithm.

Keywords: parallel I/O, parallel file system, pario-bib

lee:impl:
Edward K. Lee. Software and performance issues in the implementation of a RAID prototype. Technical Report UCB/CSD 90/573, EECS, Univ. California at Berkeley, May 1990.

Keywords: parallel I/O, disk striping, performance, pario-bib

Comment: Details of their prototype. Defines terms like stripe unit. Explores ways to lay out parity. Does performance simulations. Describes ops needed in device driver. Good to read if you plan to implement a RAID. Results: small R+W, or high loads, don't care about parity placement; in low load, there are different best cases for large R+W. Best all-around is left-symmetric. See also lee:parity.

lee:jparity:
Edward K. Lee and Randy H. Katz. The performance of parity placements in disk arrays. IEEE Transactions on Computers, 42(6):651-664, June 1993.
See also earlier version lee:parity.
See also later version lee:bparity.

Keywords: RAID, reliability, parallel I/O, disk striping, pario-bib

Comment: Journal version of lee:parity.

lee:logical-disks:
Jang Sun Lee, Jungmin Kim, P. Bruce Berra, and Sanjay Ranka. Logical disks: User-controllable I/O for scientific applications. In Proceedings of the 1996 IEEE Symposium on Parallel and Distributed Processing, pages 340-347. IEEE Computer Society Press, October 1996.

Abstract: In this paper we propose user-controllable I/O operations and explore the effects of them with some synthetic access patterns. The operations allow users to determine a file structure matching the access patterns, control the layout and distribution of data blocks on physical disks, and present various access patterns with a minimum number of I/O operations. The operations do not use a file pointer to access data as in typical file systems, which eliminates the overhead of managing the offset of the file, making it easy to share data and reducing the number of I/O operations.

Keywords: logical disks, parallel I/O, pario-bib

lee:pario:
K-K. Lee and P. Varman. Prefetching and I/O parallelism in multiple disk systems. In Proceedings of the 1995 International Conference on Parallel Processing, pages III:160-163, St. Charles, IL, August 1995. CRC Press.

Keywords: parallel I/O, prefetching, disk array, pario-bib

lee:parity:
Edward K. Lee and Randy H. Katz. Performance consequences of parity placement in disk arrays. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 190-199, 1991.
See also later version lee:jparity.

Keywords: RAID, disk array, reliability, parallel I/O, pario-bib

Comment: Interesting comparison of several parity placement schemes. Boils down to two basic choices, depending on whether read performance or write performance is more important to you.

lee:petal:
Edward K. Lee and Chandramohan A. Thekkath. Petal: Distributed virtual disks. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 84-92, Cambridge, MA, October 1996.

Keywords: parallel I/O, distributed file system, declustering, reliability, pario-bib

Comment: They are trying to build a file server that is easier to manage than most of today's distributed file systems, because disks are cheap but management is expensive. They describe a distributed file server that spreads blocks of all files across many disks and many servers. They use chained declustering so that they can survive loss of server or disk. They dynamically balance load. They dynamically reconfigure when new virtual disks are created or new physical disks are added. They've built it all and are now going to look at possible file systems that can take advantage of the features of Petal.

lee:raidmodel:
Edward K. Lee and Randy H. Katz. An analytic performance model of disk arrays. In Proceedings of the 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 98-109, 1993.

Keywords: disk array, parallel I/O, RAID, analytic model, pario-bib

lee:redist:
Jang Sun Lee, Sanjay Ranka, and Ravi V. Shankar. Communication-efficient and memory-bounded external redistribution. Technical report, Syracuse University, 1995.

Abstract: This paper presents communication-efficient algorithms for the external data redistribution problem. Deterministic lower bounds and upper bounds are presented for the number of I/O operations, communication time and the memory requirements of external redistribution. Our algorithms differ from most other algorithms presented for out-of-core applications in that it is optimal (within a small constant factor) not only in the number of I/O operations, but also in the time taken for communication. A coarse-grained MIMD architecture with I/O subsystems attached to each processor is assumed, but the results are expected to be applicable over a wider variety of architectures.

Keywords: parallel I/O algorithm, out-of-core, pario-bib

Comment: See shankar:transport for the underlying communication primitives.

lee:support:
Jenq Kuen Lee, Ing-Kuen Tsaur, and San-Yih Huang. Language and environmental support for parallel object I/O on distributed memory environments. In Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, pages 756-761. SIAM, February 1995.

Abstract: The paper describes a parallel file object environment to support distributed array store on shared nothing distributed computing environments. Our environment enables programmers to extend the concept of array distribution from memory levels to file levels. It allows parallel I/O according to the distribution of objects in an application. When objects are read and/or written by multiple applications using different distributions, we present a novel scheme to help programmers to select the best data distribution pattern according to minimum amount of remote data movements for the store of array objects on distributed file systems.

Keywords: parallel I/O, object oriented, distributed memory, pario-bib

lee:userio:
Jang Sun Lee, Sang-Gue Oh, Bruce P. Berra, and Sanjay Ranka. User-controllable I/O for parallel computers. In International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA '96), pages 442-453, August 1996.

Abstract: This paper presents the design of UPIO, a software for user-controllable parallel input and output. UPIO is designed to maximize I/O performance for scientific applications on MIMD multicomputers. The most important features of UPIO are: It supports a domain-specific file model and a variety of application interfaces to present numerous access patterns. UPIO provides user-contollerable I/O operations that allow users to control data access, file structure, and data distribution. The domain-specific file model and user controllability give low I/O overhead and allow programmers to exploit the aggregate bandwidth of parallel disks.

Keywords: parallel I/O, pario-bib

Comment: They describe an interface that seems to allow easier access for programmers that want to map matrices onto parallel files. The concepts are not well explained, so it's hard to really understand what is new and different. They make no explicit comparison with other advanced interfaces like that in Vesta or Galley. No performance results.

leon:dfs:
Christopher S. Leon. An implementation of external-memory depth-first search. Technical Report PCS-TR98-333, Dept. of Computer Science, Dartmouth College, June 1998.

Abstract: In many different areas of computing, problems can arise which are too large to fit in main memory. For these problems, the I/O cost of moving data between main memory and secondary storage (for example, disks) becomes a significant bottleneck affecting the performance of the program.

Since most algorithms do not take into account the size of main memory, new algorithms have been developed to optimize the number of I/O's performed. This paper details the implementation of one such algorithm, for external-memory depth-first search.

Depth-first search is a basic tool for solving many problems in graph theory, and since graph theory is applicable for many large computational problems, it is important to make sure that such a basic tool is designed to avoid the bottleneck of main memory to secondary storage I/O's.

The algorithm whose implementation is described in this paper is sketched out in an extended abstract by Chiang et al. We attempt to improve the given algorithm by minimizing I/O's performed, and to extend the algorithm by finding disjoint trees, and by classifying all the edges in the problem.

Keywords: out-of-core algorithm, parallel I/O, pario-bib

Comment: Senior honors thesis. Advisor: Tom Cormen.

lepper:cfd:
J. Lepper, U. Schnell, and K.R.G. Hein. Parallelization of a simulation code for reactive flows on the Intel Paragon. Computers and Mathematics with Applications, 35(7):101-109, April 1998.

Abstract: The paper shows the implementation of a 3D simulation code for turbulent flow and combustion processes in full-scale utility boilers on an Intel Paragon XP/S computer. For the portable parallelization, an explicit approach is chosen using a domain decomposition method for the static subdivision of the numerical grid together with the SPMD programming model. The measured speedup for the presented case using a coarse grid is good, although some numerical requirements restrict the implemented message passing to strongly synchronized communication. On the Paragon, the NX message passing library is used for the computations. Furthermore, MPI and PVM are applied and their pros and cons on this computer are described. In addition to the basic message passing techniques for local and global communication, other possibilities are investigated. Besides the applicability of the vectorizing capability of the compiler, the influence of the I/O performance during computations is demonstrated. The scalability of the parallel application is presented for a refined discretization.

Keywords: parallel I/O, application, pario-bib

li:bfxm:
Qun Li, Jie Jing, and Li Xie. BFXM: A parallel file system model based on the mechanism of distributed shared memory. ACM Operating Systems Review, 31(4):30-40, October 1997.

Keywords: parallel file system, distributed shared memory, DSM, COMA, pario-bib

Comment: Basically, cooperative shared memory with a backing store.

li:jmodels:
Zhiyong Li, Peter H. Mills, and John H. Reif. Models and resource metrics for parallel and distributed computation. Parallel Algorithms and Applications, 8:35-59, 1996.
See also earlier version li:models.

Keywords: parallel I/O algorithm, pario-bib

li:models:
Zhiyong Li, Peter H. Mills, and John H. Reif. Models and resource metrics for parallel and distributed computation. In Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences, pages 51-60, Hawaii, January 1995.
See also later version li:jmodels.

Abstract: This paper presents a framework of using resource metrics to characterize the various models of parallel computation. Our framework reflects the approach of recent models to abstract architectural details into several generic parameters, which we call resource metrics. We examine the different resource metrics chosen by different parallel models, categorizing the models into four classes: the basic synchronous models, and extensions of the basic models which more accurately reflect practical machines by incorporating notions of asynchrony, communication cost and memory hierarchy. We then present a new parallel computation model, the LogP-HMM model, as an illustration of design principles based on the framework of resource metrics. The LogP-HMM model extends an existing parameterized network model (LogP) with a sequential hierarchical memory model (HMM) characterizing each processor. The result accurately captures both network communication costs and the effects of multileveled memory such as local cache and I/O. We examine the potential utility of our model in the design of near optimal sorting and FFT algorithms.

Keywords: parallel I/O algorithm, pario-bib

li:recursive-tr:
Zhiyong Li, John H. Reif, and Sandeep K. S. Gupta. Synthesizing efficient out-of-core programs for block recursive algorithms using block-cyclic data distributions. Technical Report 96-04, Dept. of Computer Science, Duke University, March 1996.
See also later version li:recursive.

Abstract: In this paper, we present a framework for synthesizing I/O efficient out-of-core programs for block recursive algorithms, such as the fast Fourier transform (FFT) and block matrix transposition algorithms. Our framework uses an algebraic representation which is based on tensor products and other matrix operations. The programs are optimized for the striped Vitter and Shriver's two-level memory model in which data can be distributed using various cyclic(B) distributions in contrast to the normally used physical track distribution cyclic(B_d), where B_d is the physical disk block size.

We first introduce tensor bases to capture the semantics of block-cyclic data distributions of out-of-core data and also data access patterns to out-of-core data. We then present program generation techniques for tensor products and matrix transposition. We accurately represent the number of parallel I/O operations required for the synthesized programs for tensor products and matrix transposition as a function of tensor bases and data distributions. We introduce an algorithm to determine the data distribution which optimizes the performance of the synthesized programs. Further, we formalize the procedure of synthesizing efficient out-of-core programs for tensor product formulas with various block-cyclic distributions as a dynamic programming problem.

We demonstrate the effectiveness of our approach through several examples. We show that the choice of an appropriate data distribution can reduce the number of passes to access out-of-core data by as large as eight times for a tensor product, and the dynamic programming approach can largely reduce the number of passes to access out-of-core data for the overall tensor product formulas.

Keywords: parallel I/O, out-of-core algorithm, pario-bib

li:synthesizing:
Zhiyong Li, John H. Reif, and Sandeep K. S. Gupta. Synthesizing efficient out-of-core programs for block recursive algorithms using block-cyclic data distributions. In Proceedings of the 1996 International Conference on Parallel Processing, pages II:142-149, St. Charles, IL, August 1996. IEEE Computer Society Press.
See also earlier version li:synthesizing-tr.

Abstract: This paper presents a framework for synthesizing I/O-efficient out-of-core programs for block recursive algorithms, such as the fast Fourier transform and matrix transpositions. the programs are synthesized from tensor (Kronecker) product representations of algorithms. These programs are optimized for a striped two-level memory model where in the out-of-core data can have block-cyclic distributions on multiple disks.

Keywords: parallel I/O algorithm, pario-bib

li:synthesizing-tr:
Zhiyong Li, John H. Reif, and Sandeep K. S. Gupta. Synthesizing efficient out-of-core programs for block recursive algorithms using block-cyclic data distributions. Technical Report TR-96-04, Dept. of Computer Science, Duke University, March 1996.
See also later version li:synthesizing.

Abstract: In this paper, we present a framework for synthesizing I/O efficient out-of-core programs for block recursive algorithms, such as the fast Fourier transform (FFT) and block matrix transposition algorithms. Our framework uses an algebraic representation which is based on tensor products and other matrix operations. The programs are optimized for the striped Vitter and Shriver's two-level memory model in which data can be distributed using various cyclic(B) distributions in contrast to the normally used physical track distribution cyclic(B_d), where B_d is the physical disk block size.

We first introduce tensor bases to capture the semantics of block-cyclic data distributions of out-of-core data and also data access patterns to out-of-core data. We then present program generation techniques for tensor products and matrix transposition. We accurately represent the number of parallel I/O operations required for the synthesized programs for tensor products and matrix transposition as a function of tensor bases and data distributions. We introduce an algorithm to determine the data distribution which optimizes the performance of the synthesized programs. Further, we formalize the procedure of synthesizing efficient out-of-core programs for tensor product formulas with various block-cyclic distributions as a dynamic programming problem.

We demonstrate the effectiveness of our approach through several examples. We show that the choice of an appropriate data distribution can reduce the number of passes to access out-of-core data by as large as eight times for a tensor product, and the dynamic programming approach can largely reduce the number of passes to access out-of-core data for the overall tensor product formulas.

Keywords: parallel I/O algorithm, pario-bib

liao:overlapping:
Wei keng Liao, Alok Choudhary, Kenin Coloma, Lee Ward, Eric Russell, and Neil Pundit. Scalable implementations of MPI atomicity for concurrent overlapping I/O. In CS Sadayappan, P; Yang, editor, Proceedings of the 2003 International Conference on Parallel Processing, pages 239-246, Kaohsiung, Taiwan, October 2003. IEEE Computer Society Press.

Abstract: For concurrent I/O operations, atomicity defines the results in the overlapping file regions simultaneously read/written by requesting processes. Atomicity has been well studied at the file system level, such as POSIX standard. In this paper, we investigate the problems arising from the implementation of MPI atomicity for concurrent overlapping write access and provide two programming solutions. Since the MPI definition of atomicity differs from the POSIX one, an implementation that simply relies on the POSIX file systems does not guarantee correct MPI semantics. To have a correct implementation of atomic I/O in MPI, we examine the efficiency of three approaches: 1) file locking, 2) graph-coloring, and 3) process-rank ordering. Performance complexity for these methods are analyzed and their experimental results are presented for file systems including NFS, SGI's XFS, and IBM's GPFS.

Keywords: MPI, concurrent I/O operations, overlapping write access, atomic I/O operations, pario-bib

ligon:pfs:
W. B. Ligon and R. B. Ross. Implementation and performance of a parallel file system for high performance distributed applications. In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, pages 471-480. IEEE Computer Society Press, August 1996.

Abstract: Dedicated cluster parallel computers (DCPCs) are emerging as low-cost high performance environments for many important applications in science and engineering. A significant class of applications that perform well on a DCPC are coarse-grain applications that involve large amounts of file I/O. Current research in parallel file systems for distributed systems is providing a mechanism for adapting these applications to the DCPC environment. We present the Parallel Virtual File System (PVFS), a system that provides disk striping across multiple nodes in a distributed parallel computer and file partitioning among tasks in a parallel program. PVFS is unique among similar systems in that it uses a stream-based approach that represents each file access with a single set of request parameters and decouples the number of network messages from details of the file striping and partitioning. PVFS also provides support for efficient collective file accesses and allows overlapping file partitions. We present results of early performance experiments that show PVFS achieves excellent speedups in accessing moderately sized file segments.

Keywords: parallel I/O, cluster computing, parallel file system, pario-bib

lin:clusterio:
Zheng Lin and Songnian Zhou. Parallelizing I/O intensive applications for a workstation cluster: a case study. In Proceedings of the IPPS '93 Workshop on Input/Output in Parallel Computer Systems, pages 17-36, Newport Beach, CA, 1993. Also published in Computer Architecture News 21(5), December 1993, pages 15-22.

Keywords: parallel I/O, workstation cluster, text retrieval, pario-bib

Comment: They implement a parallel text retrieval application on a cluster of DEC 5000 workstations.

lin:optimizing:
Yih-Fang Lin, Chien-Min Wang, and Jan-Jan Wu. Optimizing i/o server placement for parallel i/o on switch-based irregular networks. Lecture Notes in Computer Science, 3358:997-1006, November 2004.

Abstract: In this paper, we study I/O server placement for optimizing parallel I/O performance on switch-based clusters, which typically adopt irregular network topologies to allow construction of scalable systems with incremental expansion capability. Finding optimal solution to this problem is computationally intractable. We quantified the number of messages travelling through each network link by a workload function, and developed three heuristic algorithms to find good solutions based on the values of the workload function. Our simulation results demonstrate performance advantage of our algorithms over a number of algorithms commonly used in existing parallel systems. In particular, the load-balance-based algorithm is superior to the other algorithms in most cases, with improvement ratio of 10% to 95 in terms of parallel I/O throughput.

Keywords: I/O server placement, network topologies, switch-based clusters, pario-bib

litwin:LSA:
Witold Litwin and Jai Menon. Scalable distributed log structured arrays. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 8, pages 107-116. IEEE Computer Society Press and Wiley, New York, NY, 2001.

Keywords: disk array, log-structured file system, RAID, parallel I/O, pario-bib

Comment: Part of jin:io-book.

liu:pario-interface:
X. Liu. The performance research of the distributed parallel server system with distributed parallel I/O interface. Acta Electronica Sinica, 30(12):1808-1810, 2002.

Keywords: parallel I/O, pario-bib

livny:stripe:
M. Livny, S. Khoshafian, and H. Boral. Multi-disk management algorithms. In Proceedings of the 1987 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 69-77, May 1987.

Keywords: parallel I/O, disk striping, disk array, pario-bib

lo:disks:
Raymond Lo and Norman Matloff. A probabilistic limit on the virtual size of replicated file systems. Technical report, Department of EE and CS, UC Davis, 1989.

Keywords: parallel I/O, replication, file system, disk mirroring, disk shadowing, pario-bib

Comment: A look at shadowed disks. If you have $k$ disks set up to read from the disk with the shortest seek, but write to all disks, you have increased reliability, read time like the min of the seeks, and write time like the max of the seeks. It appears that with increasing $k$ you can get good performance. But this paper clearly shows, since writes move all disk heads to the same location, that the effective value of $k$ is actually quite low. Only 4-10 disks are likely to be useful for most traffic loads.

lockey:characterization:
P. Lockey, R. Proctor, and I. D. James. Characterization of I/O requirements in a massively parallel shelf sea model. The International Journal of High Performance Computing Applications, 12(3):320-332, Fall 1998.

Abstract: It is now recognized that a high level of I/O performance is crucial in making effective use of parallel machines for many scientific application codes. This paper considers the I/O requirements in one particular scientific application area; 3D modelling of continental shelf sea regions. We identify some of the scientific aims which drive the model development, and the consequent impact on the I/O needs. As a case study we take a parallel production code running a simulation of the North Sea on a Cray T3D platform and investigate the I/O performance in dealing with the dominant I/O component; dumping of results data to disk. In order to place the performance issues in a more general framework we construct a simple theoretical model of I/O requirements, and use this to probe the impact of available I/O performance on current and proposed scientific objectives.

Keywords: parallel I/O application, pario-bib

Comment: In a Special Issue on I/O in Parallel Applications, volume 12, numbers 3 and 4.

long:swift-raid:
Darrell D. E. Long and Bruce R. Montague. Swift/RAID: A distributed RAID system. Computing Systems, 7(3):333-359, Summer 1994.

Keywords: RAID, disk array, parallel I/O, distributed file system, pario-bib

Comment: One of the features of this system is the way they develop and execute transaction plans as little scripts that are built by the client, sent to the servers, and then executed by interpreters.

loverso:sfs:
Susan J. LoVerso, Marshall Isman, Andy Nanopoulos, William Nesheim, Ewan D. Milne, and Richard Wheeler. \em sfs: A parallel file system for the CM-5. In Proceedings of the 1993 Summer USENIX Technical Conference, pages 291-305, 1993.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: They took the Unix file system from SunOS and extended it to run on the CM-5. This involved handling non-power-of-two block sizes, parallel I/O calls, large file sizes, and more encouragement for extents to be allocated. The hardware is particularly suited to RAID 3 with a 16 byte striping unit, although in theory the software could do anything it wants. Geared to data-parallel model. Proc nodes (PNs) contact the timesharing daemon (TSD) on the control processor (CP), who gets block lists from the file system, which runs on one of the CPs. The TSD then arranges with the disk storage nodes (DSNs) to do the transfer directly with the PNs. Each DSN has 8 MB of buffer space, 8 disk drives, 4 SCSI busses, and a SPARC as controller. Partition managers mount non-local sfs via NFS. Performance results good. Up to 185 MB/s on 118 (2 MB/s) disks.

lumb:facade:
Christopher R. Lumb. Façade: Virtual storage devices with performance guarantees. In Proceedings of the USENIX FAST '03 Conference on File and Storage Technologies, San Francisco, CA, April 2003. USENIX Association.

Abstract: High-end storage systems, such as those in large data centers, must service multiple independent workloads. Workloads often require predictable quality of service, despite the fact that they have to compete with other rapidly-changing workloads for access to common storage resources. We present a novel approach to providing performance guaran-tees in this highly-volatile scenario, in an efficient and cost-effective way. Façade, a virtual store controller, sits between hosts and storage devices in the network, and throttles individual I/O requests from multiple clients so that devices do not saturate. We implemented a prototype, and evaluated it using real workloads on an enterprise storage system. We also instantiated it to the particular case of emulating commercial disk arrays. Our results show that Façade satisfies performance objectives while making efficient use of the storage resources-even in the presence of failures and bursty workloads with stringent performance requirements.

Keywords: file systems, qos, quality of service, pario-bib

lyster:geos-das:
P.M. Lyster, K. Ekers, J. Guo, M. Harber, D. Lamich, J.W. Larson, R. Lucchesi, R. Rood, S. Schubert, W. Sawyer, M. Sienkiewicz, A. da Silva, J. Stobie, L.L. Takacs, R. Todling, and J. Zero. Parallel computing at the NASA data assimilation office (DAO). In Proceedings of SC97: High Performance Networking and Computing, San Jose, CA, November 1997. IEEE Computer Society Press.

Keywords: parallel I/O, pario-bib

Comment: This paper is about a NASA project GEOS-DAS (Goddard Earth Observing System-Data Assimilation System). The goal of the project is to produce ''accurate gridded datasets of atmospheric fields''. The data will be used by meteorologists for weather analysis and forecasts as well as being a tool for climate research. This paper discusses their plans to parallelize the core code of the system. They include a section on parallel I/O.

ma:buffering:
Xiasong Ma, Marianne Winslett, Jonghyun Lee, and Shengke Yu. Improving MPI IO output performance with active buffering plus threads. In Proceedings of the International Parallel and Distributed Processing Symposium. IEEE Computer Society Press, April 2003.

Abstract: Efficient collective output of intermediate results to secondary storage becomes more and more important for scientific simulations as the gap between process-ing power/interconnection bandwidth and the I/O sys-tem bandwidth enlarges. Dedicated servers can offload I/O from compute processors and shorten the execution time, but it is not always possible or easy for an appli-cation to use them. We propose the use of active buffer-ing with threads (ABT) for overlapping I/O with com-putation efficiently and flexibly without dedicated I/O servers. We show that the implementation of ABT in ROMIO, a popular implementation of MPI-IO, greatly reduces the application-visible cost of ROMIO s collec-tive write calls, and improves an application s overall performance by hiding I/O cost and saving implicit syn-chronization overhead from collective write operations. Further, ABT is high-level, platform-independent, and transparent to users, giving users the benefit of over-lapping I/O with other processing tasks even when the file system or parallel I/O library does not support asyn-chronous I/O.

Keywords: parallel I/O, pario-bib

ma:flexible:
Xiasong Ma, Xiangmin Jiao, and Michael Campbell oand Marianne Winslett. Flexible and efficient parallel I/O for large-scale multi-component simulations. In Proceedings of the Fourth Workshop on Parallel and Distributed Scientific and Engineering Computing with Applications. IEEE Computer Society Press, April 2003.

Abstract: In this paper, we discuss our experience of providing high performance parallel I/O for a large-scale, on-going, multi-disciplinary simulation project for solid propellant rockets. We describe the performance and data management issues observed in this project and present our solutions, including (1) support for relatively fine-grained distribution of irregular datasets in parallel I/O, (2) a flexible data management facility for inter-module communication, and (3) two schemes to overlap computation with I/O. Performance results obtained from the rocket simulation's development and production platforms show that our I/O optimizations can dramatically reduce the simulation's visible I/O cost, as well as the number of disk files, and significantly improve the overall performance. Meanwhile, our data management facility helps to provide simulation developers with simple user interfaces for parallel I/O.

Keywords: parallel I/O, pario-bib

mache:spatial:
Jens Mache, Virginia Lo, Marilynn Livingston, and Sharad Garg. The impact of spatial layout of jobs on parallel I/O performance. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems, pages 45-56, Atlanta, GA, May 1999. ACM Press.

Abstract: Input/Output is a big obstacle to effective use of teraflops-scale computing systems. Motivated by earlier parallel I/O measurements on an Intel TFLOPS machine, we conduct studies to determine the sensitivity of parallel I/O performance on multi-programmed mesh-connected machines with respect to number of I/O nodes, number of compute nodes, network link bandwidth, I/O node bandwidth, spatial layout of jobs, and read or write demands of applications.

Our extensive simulations and analytical modeling yield important insights into the limitations on parallel I/O performance due to network contention, and into the possible gains in parallel I/O performance that can be achieved by tuning the spatial layout of jobs.

Applying these results, we devise a new processor allocation strategy that is sensitive to parallel I/O traffic and the resulting network contention. In performance evaluations driven by synthetic workloads and by a real workload trace captured at the San Diego Supercomputing Center, the new strategy improves the average response time of parallel I/O intensive jobs by up to a factor of 4.5.

Keywords: parallel I/O, pario-bib

maciel:dgw:
Frederico B. Maciel, Nobutoshi Sagawa, and Teruo Tanaka. Dynamic Gateways: A novel approach to improve networking performance and availability on parallel servers. In Proceedings of the High-Performance Computing and Networking Symposium (HPCN'98), pages 678-687, 1998.

Abstract: Parallel servers realize scalability and availability by effectively using multiple hardware resources (i.e., nodes and disks). Scalability is improved by distributing processes and data onto multiple resources; and availability is maintained by substituting a failed resource with a spare one. Dynamic Gateways extends these features to networking, by balancing the traffic among multiple connections to the network in order to improve scalability, and detours traffic around failed resources to maintain availability. This is made transparent to the clients and to applications in the server by using proxy and gratuitous ARP to control the network traffic. A performance evaluation shows that Dynamic Gateways improves the scalability (allowing the maximum networking performance to increase with increasing number of connections) and the performance (improving throughput and reducing access latency).

Keywords: parallel networking, network I/O, parallel I/O, pario-bib

Comment: Contact fred-m@crl.hitachi.co.jp, sagawa@crl.hitachi.co.jp, or tetanaka@kanagawa.hitachi.co.jp.

mackay:groundwater:
David Mackay, G. Mahinthakumar, and Ed D'Azevedo. A study of I/O in a parallel finite element groundwater transport code. The International Journal of High Performance Computing Applications, 12(3):307-319, Fall 1998.

Abstract: A parallel finite element groundwater transport code is used to compare three different strategies for performing parallel I/O: (1) have a single processor collect data and perform sequential I/O in large blocks, (2) use variations of vendor specific I/O extensions (3) use the EDONIO I/O library. Each processor performs many writes of one to four kilobytes to reorganize localdata in a global shared file. Our findings suggest having a single processor collect data and perform large block-contiguous operations may be quite efficient and portable for up to 32 processor configurations. This approach does not scale well for a larger number of processors since the single processor becomes a bottleneck for gathering data. The effective application I/O rate observed, which includes times for opening and closing files, is only a fraction of the peak device read/write rates. Some form of data redistribution and buffering in remote memory as performed in EDONIO may yield significant improvements for non-contiguous data I/O access patterns and short requests. Implementors of parallel I/O systems may consider some form of buffering as performed in EDONIO to speed up such I/O requirements.

Keywords: parallel I/O application, pario-bib

Comment: In a Special Issue on I/O in Parallel Applications, volume 12, numbers 3 and 4.

madhyastha:adaptive:
Tara M. Madhyastha and Daniel A. Reed. Intelligent, adaptive file system policy selection. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 172-179. IEEE Computer Society Press, October 1996.
See also later version madhyastha:thesis.

Abstract: Traditionally, maximizing input/output performance has required tailoring application input/output patterns to the idiosyncrasies of specific input/output systems. The authors show that one can achieve high application input/output performance via a low overhead input/output system that automatically recognizes file access patterns and adaptively modifies system policies to match application requirements. This approach reduces the application developer's input/output optimization effort by isolating input/output optimization decisions within a retargetable file system infrastructure. To validate these claims, they have built a lightweight file system policy testbed that uses a trained learning mechanism to recognize access patterns. The file system then uses these access pattern classifications to select appropriate caching strategies, dynamically adapting file system policies to changing input/output demands throughout application execution. The experimental data show dramatic speedups on both benchmarks and input/output intensive scientific applications.

Keywords: parallel I/O, pario-bib

Comment: See also madhyastha:thesis, and related papers.

madhyastha:classification:
Tara M. Madhyastha and Daniel A. Reed. Input/output access pattern classification using hidden Markov models. In Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed Systems, pages 57-67, San Jose, CA, November 1997. ACM Press.
See also later version madhyastha:thesis.

Abstract: Input/output performance on current parallel file systems is sensitive to a good match of application access pattern to file system capabilities. Automatic input/output access classification can determine application access patterns at execution time, guiding adaptive file system policies. In this paper we examine a new method for access pattern classification that uses hidden Markov models, trained on access patterns from previous executions, to create a probabilistic model of input/output accesses. We compare this approach to a neural network classification framework, presenting performance results from parallel and sequential benchmarks and applications.

Keywords: workload characterization, file access pattern, parallel I/O, pario-bib

Comment: The most interesting thing in this paper is the use of a Hidden Markov Model to understand the access pattern of an application to a file. After running the application on the file once, and simultaneously training their HMM, they use the result to tune the system for the next execution (cache size, cache partitioning, prefetching, Intel file mode, etc). They get much better performance in future runs. See also madhyastha:thesis, and related papers.

madhyastha:global:
Tara M. Madhyastha and Daniel A. Reed. Exploiting global input/output access pattern classification. In Proceedings of SC97: High Performance Networking and Computing, San Jose, November 1997. ACM Press.
See also later version madhyastha:thesis.

Abstract: Parallel input/output systems attempt to alleviate the performance bottleneck that affects many input/output intensive applications. In such systems, an understanding of the application access pattern, especially how requests from multiple processors for different file regions are logically related, is important for optimizing file system performance. We propose a method for automatically classifying these global access patterns and using these global classifications to select and tune file system policies to improve input/output performance. We demonstrate this approach on benchmarks and scientific applications using global classification to automatically select appropriate underlying Intel PFS input/output modes and server buffering strategies.

Keywords: file access pattern, parallel I/O, pario-bib

Comment: No page numbers: web and CDROM proceedings only. See also madhyastha:thesis and related papers.

madhyastha:informed:
Tara M. Madhyastha, Garth A. Gibson, and Christos Faloutsos. Informed prefetching of collective input/output requests. In Proceedings of SC99: High Performance Networking and Computing, Portland, OR, November 1999. ACM Press and IEEE Computer Society Press.

Keywords: informed prefetching, disk-directed I/O, parallel I/O, pario-bib

Comment: They argue that if enough application prefetches are made, a standard Unix interface will provide the same performance as a collective I/O interface. She uses simulation to show that if the file ordering is preserved, then the prefetch depth (the number of advance requests) is bounded by the number of disk drives. They look at two global access patterns: a simple interleaved sequential pattern and a 3-D block decomposition. Their experiment used 8 procs and 8 disks and did a comparison of the prefetching techniques to disk-directed I/O. Emperical studies showed that they needed a prefetch horizon of one to two times the number of disks to match the performance of disk-directed I/O, but the prefetching techniques require more memory.

madhyastha:learning:
Tara M. Madhyastha and Daniel A. Reed. Learning to classify parallel input/output access patterns. IEEE Transactions on Parallel and Distributed Systems, 13(8):802-813, August 2002.

Abstract: Input/output performance on current parallel file systems is sensitive to a good match of application access patterns to file system capabilities. Automatic input/output access pattern classification can determine application access patterns at execution time, guiding adaptive file system policies. In this paper, we examine and compare two novel input/output access pattern classification methods based on learning algorithms. The first approach uses a feedforward neural network previously trained on access pattern benchmarks to generate qualitative classifications. The second approach uses hidden Markov models trained on access patterns from previous executions to create a probabilistic model of input/output accesses. In a parallel application, access patterns can be recognized at the level of each local thread or as the global interleaving of all application threads. Classification of patterns at both levels is important for parallel file system performance; we propose a method for forming global classifications from local classifications. We present results from parallel and sequential benchmarks and applications that demonstrate the viability of this approach.

Keywords: parallel I/O, file access pattern, pario-bib

madhyastha:optimizing:
Tara M. Madhyastha, Christopher L. Elford, and Daniel A. Reed. Optimizing input/output using adaptive file system policies. In Proceedings of the Fifth NASA Goddard Conference on Mass Storage Systems and Technologies, pages II:493-514, September 1996.
See also later version madhyastha:thesis.

Keywords: multiprocessor file system, prefetching, caching, parallel I/O, multiprocessor file system interface, pario-bib

Comment: See also madhyastha:thesis, and related papers.

madhyastha:thesis:
Tara Madhyastha. Automatic Classification of Input/Output Access Patterns. PhD thesis, University of Illinois, Urbana-Champaign, August 1997.

Keywords: parallel I/O, file access pattern, pario-bib

Comment: See also madhyastha:classification, madhyastha:global, madhyastha:adaptive, madhyastha:optimizing.

magoutis:direct:
Kostas Magoutis, Salimah Addetia, Alexandra Fedorova, and Margo I. Seltzer. Making the most out of direct-access network attached storage. In Proceedings of the USENIX FAST '03 Conference on File and Storage Technologies, San Francisco, CA, April 2003. USENIX Association.

Abstract: The performance of high-speed network-attached storage applications is often limited by end-system overhead, caused primarily by memory copying and network protocol processing. In this paper, we examine alternative strategies for reducing overhead in such systems. We consider optimizations to remote procedure call (RPC)-based data transfer using either remote direct memory access (RDMA) or network interface support for pre-posting of application receive buffers. We demonstrate that both mechanisms enable file access throughput that saturates a 2Gb/s network link when performing large I/Os on relatively slow, commodity PCs. However, for multi-client workloads dominated by small I/Os, throughput is limited by the per-I/O overhead of processing RPCs in the server. For such workloads, we propose the use of a new network I/O mechanism, Optimistic RDMA (ORDMA). ORDMA is an alternative to RPC that aims to improve server throughput and response time for small I/Os. We measured performance improvements of up to 32% in server throughput and 36% in response time with use of ORDMA in our prototype.

Keywords: file systems, rpc optimizations, rdma, multi-client workload, small I/O, pario-bib

majumdar:characterize:
S. Majumdar and Yiu Ming Leung. Characterization of applications with I/O for processor scheduling in multiprogrammed parallel systems. In Proceedings of the 1994 IEEE Symposium on Parallel and Distributed Processing, pages 298-307. IEEE Computer Society Press, 1994.

Abstract: Most studies of processor scheduling in multiprogrammed parallel systems have ignored the I/O performed by applications. Recent studies have demonstrated that significant I/O operations are performed by a number of different classes of parallel applications. This paper focuses on some basic issues that underlie scheduling in multiprogrammed parallel environments running applications with I/O. Characterization of the I/O behavior of parallel applications is discussed first. Based on simulation models this research investigates the influence of these I/O characteristics on processor scheduling.

Keywords: workload characterization, scheduling, parallel I/O, pario-bib

majumdar:management:
Shikaresh Majumdar and Faisal Shad. Characterization and management of I/O on multiprogrammed parallel systems. In Proceedings of the 1995 IEEE Symposium on Parallel and Distributed Processing, pages 502-510, San Antonio, TX, October 1995. IEEE Computer Society Press.

Keywords: workload characterization, parallel I/O, pario-bib

Comment: Analytical workload model. Simulation studies. See also kwong:distribution.

malluhi:pss:
Qutaibah Malluhi and William E. Johnston. Approaches for a reliable high-performance distributed-parallel storage system. In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, pages 500-509. IEEE Computer Society Press, August 1996.

Abstract: The paper studies different schemes to enhance the reliability, availability and security of a high performance distributed storage system. We have previously designed a distributed parallel storage system that employs the aggregate bandwidth of multiple data servers connected by a high speed wide area network to achieve scalability and high data throughput. The general approach of the paper employs erasure error correcting codes to add data redundancy that can be used to retrieve missing information caused by hardware, software, or human faults. The paper suggests techniques for reducing the communication and computation overhead incurred while retrieving missing data blocks form redundant information. These techniques include clustering, multidimensional coding, and the full two dimensional parity scheme.

Keywords: parallel I/O, pario-bib

manuel:logjam:
Tom Manuel. Breaking the data-rate logjam with arrays of small disk drives. Electronics, 62(2):97-100, February 1989.

Keywords: parallel I/O, disk array, I/O bottleneck, pario-bib

Comment: See also Electronics, Nov. 88 p 24, Dec. 88 p 112. Trade journal short on disk arrays. Very good intro. No new technical content. Concentrates on RAID project. Lists several commercial versions. Mostly concentrates on single-controller versions.

marco:raid1:
R. Marco, J. Marco, D. Rodriguez, D. Cano, and I. Cabrillo. RAID-1 and data stripping across the GRID. Lecture Notes in Computer Science, 2970:119-123, March 2004.

Abstract: Stripping techniques combined with an adequate replication policy across the Grid offer the possibility to improve significatively data access and processing times, while eliminating the need for local data mirroring, so saving significatively on storage costs. First results on a local cluster following a simple strategy are presented.

Keywords: RAID, RAID-1, data striping, GRID, pario-bib

maspar:pario:
Parallel file I/O routines. MasPar Computer Corporation, 1992.

Keywords: parallel I/O, multiprocessor file system interface, pario-bib

Comment: Man pages for MasPar file system interface. They have either a single shared file pointer, after which all processors read or write in an interleaved pattern, or individual (plural) file pointer, allowing arbitrary access patterns. Updated in 1992 with many more features.

masters:pario:
Del Masters. Improve disk subsystem performance with multiple serial drives in parallel. Computer Technology Review, 7(9):76-77, July 1987.

Keywords: parallel I/O, pario-bib

Comment: Information about the early Maximum Strategy disk array, which striped over 4 disk drives, apparently synchronously.

matloff:multidisk:
Norman S. Matloff. A multiple-disk system for both fault tolerance and improved performance. IEEE Transactions on Reliability, R-36(2):199-201, June 1987.

Keywords: parallel I/O, reliability, disk shadowing, disk mirroring, pario-bib

Comment: Variation on mirrored disks using more than 2 disks, to spread the files around. Good performance increases.

matthews:hippi:
Kevin C. Matthews. Experiences implementing a shared file system on a HIPPI disk array. In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 77-88. IEEE Computer Society Press, September 1995.

Abstract: Shared file systems which use a physically shared mass storage device have existed for many years, although not on UNIX based operating systems. This paper describes a shared file system (SFS) that was implemented first as a special project on the Gray Research Inc. (CRI) UNICOS operating system. A more general product was then built on top of this project using a HIPPI disk array for the shared mass storage. The design of SFS is outlined, as well as some performance experiences with the product. We describe how SFS interacts with the OSF distributed file service (DFS) and with the CRI data migration facility (DMF). We also describe possible development directions for the SFS product.

Keywords: mass storage, distributed file system, parallel I/O, pario-bib

Comment: They use hardware to tie the same storage device (a disk array) to several computers (Cray C90s). They build a custom piece of hardware just to service semaphore requests very fast. HIPPI is the interconnect. Details have a lot to do with the synchronization between processors trying to update the same metadata; that's why they use the semaphores.

matthijs:framework:
F. Matthijs, Y. Berbers, and P. Verbaeten. A flexible I/O framework for parallel and distributed systems. In Proceedings of the Fifth International Workshop on Object Orientation in Operating Systems, pages 187-190. IEEE Computer Society Press, 1995.

Abstract: We propose a framework for I/O in parallel and distributed systems. The framework is highly customizable and extendible, and enables programmers to offer high level objects in their applications, without requiring them to struggle with the low level and sometimes complex details of high performance distributed I/O. Also, the framework exploits application specific information to improve I/O performance by allowing specialized programmers to customize the framework. Internally, we use indirection and granularity control to support migration, dynamic load balancing, fault tolerance, etc. for objects of the I/O system, including those representing application data.

Keywords: input-output programs, object-oriented, parallel systems; I/O performance, migration, dynamic load balancing, fault tolerance, parallel I/O, pario-bib

mayr:query:
Tobias Mayr, Philippe Bonnet, Johannes Gehrke, and Praveen Seshadri. Leveraging non-uniform resources for parallel query processing. In Proceedings of the Third IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 120-129, Tokyo, Japan, May 2003. IEEE Computer Society Press.

Abstract: Modular clusters are now composed of non- uniform nodes with different CPUs, disks or network cards so that customers can adapt the cluster configuration to the changing technologies and to their changing needs. This challenges dataflow parallelism as the primary load balancing technique of existing parallel database systems. We show in this paper that dataflow parallelism alone is ill suited for modular clusters because running the same operation on different subsets of the data can not fully utilize non-uniform hardware resources. We propose and evaluate new load balancing techniques that blend pipeline parallelism with data parallelism. We consider relational operators as pipelines of fine-grained operations that can be located on different cluster nodes and executed in parallel on different data subsets to best exploit non-uniform resources. We present an experimental study that confirms the feasibility and effectiveness of the new techniques in a parallel execution engine prototype based on the open-source DBMS Predator.

Keywords: parallel query processing, load balancing, parallel I/O, pario-bib

mcmurdy:unstripe:
Ronald K. McMurdy and Badrinath Roysam. Improving RAID-5 performance by un-striping moderate-sized files. In Proceedings of the 1993 International Conference on Parallel Processing, pages II-279-282, St. Charles, IL, 1993. CRC Press.

Keywords: parallel I/O, disk array, pario-bib, RAID

Comment: Allocate small- and medium-sized files entirely on one disk rather than striped, to cut seek and rotation latency that would happen if they were spread across many disks.

meador:array:
Wes E. Meador. Disk array systems. In Proceedings of IEEE Compcon, pages 143-146, Spring 1989.

Keywords: parallel I/O, disk array, disk striping, pario-bib

Comment: Describes Strategy 2 Disk Array Controller, which allows 4 or 8 drives, hardware striped, with parity drive and 0-4 hot spares. Up to 4 channels to cpu(s). Logical block interface. Defects, errors, formatting, drive failures all handled automatically. Peak 40 MB/s data transfer on each channel.

meiko:cs2:
Computing surface CS-2: Technical overview. Meiko brochure S1002-10M115.01A, 1993.

Keywords: multiprocessor architecture, parallel I/O, pario-bib

Comment: Three node types: 4 SPARC (50 MHz), 1 SPARC + two Fujitsu vector procs, or 1 SPARC + 3 I/O ports. All have a special communications processor that supports remote memory access. Each has 128 MBytes in 16 banks. Memory-memory transfer operations using ``remote DMA'', supported by the communications processor. User-level comm interface, with protection. Uses multistage network with 8x8 crossbar switches, looks like a fat tree. S/BUS, separate from the memory bus, is used for I/O, either directly, or through 2 SCSI and 1 ethernet. Control and diagnostic networks. Parallel file system stripes across multiple partitions. Can use RAID. Communications processor has its own MMU; control registers are mapped to user space. Network-wide virtual addresses can support shared memory? Remote store, atomic operations, global operations. Comm proc can support I/O threads - but can it talk to the disks? OS based on Solaris 2, plus global shared memory, parallel file system, and capability-based protection. Machine is logically partitioned into login, devices, and parallel computation.

memik:patterns:
Gokhan Memik, Mahmut Kandemir, and Alok Choudhary. Exploiting inter-file access patterns using multi-collective I/O. In Proceedings of the USENIX FAST '02 Conference on File and Storage Technologies, pages 245-258, Monterey, CA, January 2002. USENIX Association.

Abstract: This paper introduces a new concept called Multi-Collective I/O (MCIO) that extends conventional collective I/O to optimize I/O accesses to multiple arrays simultaneously. In this approach, as in collective I/O, multiple processors co-ordinate to perform I/O on behalf of each other if doing so improves overall I/O time. However, unlike collective I/O, MCIO considers multiple arrays simultaneously; that is, it has a more global view of the overall I/O behavior exhibited by application. This paper shows that determining optimal MCIO access pattern is an NP-complete problem, and proposes two different heuristics for the access pattern detection problem (also called the assignment problem). Both of the heuristics have been implemented within a runtime library, and tested using a large-scale scientific application. Our preliminary results show that MCIO out-performs collective I/O by as much as 87{PCT}. Our runtime library-based implementation can be used by users as well as optimizing compilers. Based on our results, we recom-mend future library designers for I/O-intensive applications to include MCIO in their suite of optimizations.

Keywords: file systems, pario-bib

menasce:mass:
Daniel Menascé, Odysseas Ionnis Pentakalos, and Yelena Yesha. An analytic model of hierarchical mass storage systems with network-attached storage devices. In Proceedings of the 1996 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 180-189, Philadelphia, PA, May 1996. ACM Press.

Keywords: network attached peripherals, analytic model, mass storage, parallel I/O, pario-bib

menon:bcompare:
Jai Menon. A performance comparison of RAID-5 and log-structured arrays. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 4, pages 55-64. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version menon:compare.

Keywords: RAID, disk array, parallel I/O, pario-bib

Comment: Part of jin:io-book; reformatted version of menon:compare.

menon:bsparing:
Jai Menon. Comparison of sparing alternatives for disk arrays. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 9, pages 117-128. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version menon:sparing.

Keywords: parallel I/O, RAID, disk array, pario-bib

Comment: Part of jin:io-book; reformatted version of menon:sparing.

menon:compare:
Jai Menon. A performance comparison of RAID-5 and log-structured arrays. In Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing, pages 167-178, August 1995.
See also later version menon:bcompare.

Keywords: RAID, disk array, parallel I/O, pario-bib

Comment: He compares a RAID-5 disk array with a log-structured array (LSA). An LSA is essentially an implementation of a log-structured file system inside a disk controller. The disk controller buffers up writes in a non-volatile cache; when the outgoing data buffer is full, it is written to some large contiguous region of the disk. The controller manages a directory to keep track of the various segment locations, and does garbage collection (cleaning). They can insert a compression algorithm in front of the cache so that they get better cache and disk utilization by storing data in compressed form. for fair comparison they compare with a similar feature in the plain RAID5 array.

menon:daisy:
Jai Menon and Kent Treiber. Daisy: Virtual-disk hierarchical storage manager. ACM SIGMETRICS Performance Evaluation Review, 25(3):37-44, December 1997.

Keywords: hierarchical storage, tape storage, tertiary storage, tape robot, parallel I/O, pario-bib

Comment: Part of a special issue on parallel and distributed I/O.

menon:sparing:
Jai Menon and Dick Mattson. A comparison of sparing alternatives for disk arrays. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 318-329. ACM Press, 1992.
See also later version menon:bsparing.

Abstract: This paper explores how choice of sparing methods impacts the performance of RAID level 5 (or parity striped) disk arrays. The three sparing methods examined are dedicated sparing, distributed sparing, and parity sparing. For database type workloads with random single block reads and writes, array performance is compared in four different modes - normal mode (no disks have failed), degraded mode (a disk has failed and its data has not been reconstructed), rebuild mode (a disk has failed and its data is being reconstructed), and copyback mode(which is needed for distributed sparing and parity sparing when failed disks are replaced with new disks). Attention is concentrated on small disk subsystems (fewer than 32 disks) where choice of sparing method has significant impact on array performance, rather than large disk subsystems (64 or more disks). It is concluded that, for disk subsystems with a small number of disks, distributed sparing offers major advantages over dedicated sparing in normal, degraded and rebuild modes of operation, even if one has to pay a copyback penalty. Furthermore, it is better than parity sparing in rebuild mode and similar to it in other operating modes, making it the sparing method of choice.

Keywords: parallel I/O, RAID, disk array, pario-bib

menor:grid-io:
José M. Pérez Menor, Félix Garc\'\ia, Jesús Carretero, Alejandro Calderón, Javier Fernández, and José Daniel Garc\'\ia. A parallel I/O middleware to integrate heterogeneous storage resources on grids. Lecture Notes in Computer Science, 2970:124-131, March 2004.

Abstract: The philosophy behind grid is to use idle resources to achieve a higher level of computational services (computation, storage, etc). Existing data grids solutions are based in new servers, specific APIs and protocols, however this approach is not a realistic solution for enterprises and universities, because this supposes the deployment of new data servers across the company. This paper describes a new approach to data access in computational grids. This approach is called GridExpand, a parallel I/O middleware that integrates heterogeneous data storage resources in grids. The proposed grid solution integrates available data network solutions (NFS, CIFS, WebDAV) and makes possible the access to a global grid file system. Our solution differs from others because it does not need the installation of new data servers with new protocols. Most of the data grid solutions use replication as the way to obtain high performance. Replication, however, introduce consistency problem for many collaborative applications, and sometimes requires the usage of lots of resources. To obtain high performance, we apply the parallel I/O techniques used in parallel file systems.

Keywords: data grids, parallel I/O, data declustering, pario-bib

merchant:striping:
Arif Merchant and Philip S. Yu. Analytic modeling and comparisons of striping strategies of replicated disk arrays. IEEE Transactions on Computers, 44(3):419-431, March 1995.

Keywords: disk striping, disk array, RAID, parallel I/O, pario-bib

merriam:triangle:
Drshal L. Merriam. Parallel implementation of an algorithm for Delaunay triangulation. In Proceedings of Computational Fluid Dynamics, volume 2, pages 907-912, 1992.

Keywords: parallel I/O, file system workload, pario-bib

Comment: This application runs on the NASA Ames iPSC/860. This application has some I/O: reading in the input file, which is a set of x,y,z data points. I/O was really slow if formatted (ie, ASCII instead of binary) or sequential instead of parallel. Any input record could go to any processor; the first step in the algorithm (after the points are read in) is essentially a kind of sort to move points around to localize points and balance load.

messerli:jimage:
Vincent Messerli, Oscar Figueiredo, B. Gennart, and Roger D. Hersch. Parallelizing I/O-intensive image access and processing applications. IEEE Concurrency, 7(2):28-37, 1999.

Abstract: This article presents methods and tools for building parallel applications based on commodity components: PCs, SCSI disks, Fast Ethernet, Windows NT. Chief among these tools is CAP, our computer-aided parallelization tool. CAP generates highly pipelined applications that run communication and I/O operations in parallel with processing operations. One of CAP's successes is the Visible Human Slice Server, a 3D tomographic image server that allows clients to choose and view any cross section of the human body.

Keywords: applications, image processing, pario-app, parallel I/O, pario-bib

messerli:thesis:
Vincent Messerli. Tools for Parallel I/O and Compute Intensive Applications. PhD thesis, École Polytechnique Fédérale de Lausanne, 1999. Thèse 1915.

Keywords: parallel computing, image processing, parallel I/O application, parallel I/O, pario-bib

Comment: The complete description of PS$^2$ and its use with CAP, a parallelization tool, for data-flow-like support of parallel I/O. Nice work. See also messerli:jimage, gennart:CAP, vetsch:visiblehuman, messerli:tomographic.

messerli:tomographic:
R. D. Hersch V. Messerli, B. Gennart. Performances of the PS$^2$ parallel storage and processing system for tomographic image visualization. In Proceedings of the Seventeenth International Conference on Distributed Computer Systems, pages 514-522, Seoul, Korea, December 1997. IEEE Computer Society Press.

Abstract: We propose a new approach for developing parallel I/O- andcompute-intensive applications. At a high level of abstraction, a macro data flow description describes how processing and disk access operations are combined. This high-level description (CAP) is precompiled into compilable and executable C++ source language. Parallel file system components specified by CAP are offered as reusable CAP operations. Low-level parallel file system components can, thanks to the CAP formalism, be combined with processing operations in order to yield efficient pipelined parallel I/O and compute intensive programs. The underlying parallel system is based on commodity components (PentiumPro processors, Fast Ethernet) and runs on top of WindowsNT. The CAP-based parallel program development approach is applied to the development of an I/O and processing intensive tomographic 3D image visualization application. Configurations range from a single PentiumPro 1-disk system to a four PentiumPro 27-disk system. We show that performances scale well when increasing the number of processors and disks. With the largest configuration, the system is able to extract in parallel and project into the display space between three and four 512x512 images per second. The images may have any orientation and are extracted from a 100 MByte 3D tomographic image striped over the available set of disks.

Keywords: parallel computing, parallel I/O, parallel I/O application, image processing, pario-bib

Comment: See also messerli:jimage, gennart:CAP, vetsch:visiblehuman, messerli:thesis.

michael:future:
Gavin Michael and Andrew Chien. Future multicomputers: Beyond minimalist multiprocessors?. Computer Architecture News, 20(5):6-12, December 1992.

Keywords: multiprocessor architecture, compiler, parallel I/O, pario-bib

Comment: Includes some comments by Randy Katz about parallel I/O, in particular, distinguishing between ``fat'' nodes (with many disks, e.g., a RAID), and ``thin'' nodes (with one disk).

milenkovic:model:
Milan Milenkovi\'c. A model for multiprocessor I/O. Technical Report 89-CSE-30, Dept. of Computer Science and Engineering, Southern Methodist University, July 1989.

Keywords: multiprocessor I/O, I/O architecture, distributed system, pario-bib

Comment: Advocates using dedicated server processors for all I/O, e.g., disk server, terminal server, network server. Pass I/O requests and data via messages or RPC calls over the interconnect (here a shared bus). Server handles packaging, blocking, caching, errors, interrupts, and so forth, freeing the main processors and the interconnect from all this activity. Benefits: encapsulates I/O-related stuff in specific places, accommodates heterogeneity, improves performance. Nice idea, but allows for an I/O bottleneck, unless server can handle all the demand. Otherwise would need multiple servers, more expensive than just multiple controllers.

miller:iobehave:
Ethan L. Miller and Randy H. Katz. Input/output behavior of supercomputer applications. In Proceedings of Supercomputing '91, pages 567-576, Albuquerque, NM, November 1991. IEEE Computer Society Press.

Keywords: file access pattern, supercomputer, disk caching, prefetching, pario-bib

Comment: Same as miller:iobehave-tr except without the appendix outlining trace format. Included in pario-bibliography not because it measures a parallel workload, but because it is so often cited in the parallel-IO community.

miller:jrama:
Ethan L. Miller and Randy H. Katz. RAMA: An easy-to-use, high-performance parallel file system. Parallel Computing, 23(4-5):419-446, June 1997.
See also earlier version miller:rama2.

Abstract: Modern massively parallel file systems provide high bandwidth file access by striping files across arrays of disks attached to a few specialized I/O nodes. However, these file systems are hard to use and difficult to integrate with workstations and tertiary storage. RAMA addresses these problems by providing a high-performance massively parallel file system with a simple interface. RAMA uses hashing to pseudo-randomly distribute data to all of its disks, insuring high bandwidth regardless of access pattern and eliminating bottlenecks in file block accesses. This flexibility does not cause a large loss of performance - RAMA's simulated performance is within 10-15% of the optimum performance of a similarly-sized striped file system, and is a factor of 4 or more better than a striped file system with poorly laid out data.

Keywords: multiprocessor file system, parallel I/O, pario-bib

Comment: They use parallel disks of a multiprocessor as a set-associative cache for tertiary storage. Each "disk line" contains a set of blocks, and a little index that lists the blocks contained in the disk line. To access block b of a file, you hash on b/s, where s is a small factor like 4; that encourages consecutive blocks to land in the same disk line, for better locality. That gives you the disk line number. From that you compute the disk number, and the node number. Send a message to that node. It reads through the index for that disk line to find the block within the line. Metadata like file permissions are stored in the disk line with the first block of the file. Part of the paper deals with file-system integrity; no fsck is needed. When RAMA goes to tertiary storage, it reads a large batch of the file, but need not read the entire file into disk cache. Dirty data are flushed back to tertiary store periodically.

They use simulation to study performance with synthetic access patterns. Unfortunately they simulated rather small files and patterns. The paper talks quite a bit about disk (space and bandwidth) utilization, and network bandwidth utilization. One of the big benefits of this hash-based approach is that it tends to distribute the traffic to the network and to the disks very evenly, even under highly regular access patterns that might unbalance a traditional striped approach. Finally, they claim to do well on small-file workloads as well as supercomputer workloads.

miller:pario:
L. L. Miller and A. R. Hurson. Multiprogramming and concurrency in parallel file environments. International Journal of Mini and Microcomputers, 13(2):37-45, 1991.

Keywords: parallel file system, parallel I/O, database, pario-bib

Comment: This is really for databases. They identify two types of file access: one where the file can be operated on as a set of subfiles, each independently by a processor (what they call MIMD mode), and another where the file must be operated on with a centralized control (SIMD mode), in their case to search a B-tree whose nodes span the set of processors. Basically it is a host connected to a controller, that is connected to a set of small I/O processors, each of which has access to disk. In many ways a uniprocessor perspective. Paper design, with simulation results.

miller:pass:
L. L. Miller, S. R. Inglett, and A. R. Hurson. PASS- a multiuser parallel file system based on microcomputers. Journal of systems and software, 19(1):75-83, September 1992.

Abstract: Data intensive computer applications suffer from inadequate use of parallelism for processing data stored on secondary storage devices. Devices such as database machines are useful in some applications, but many applications are too small or specialized to use such technology. To bridge this gap, the authors introduce the parallel secondary storage (PASS) system. PASS is based on a network of microcomputers. The individual microcomputers are assigned to a unit of secondary storage and the operations of the microcomputers are initiated and monitored by a control processor. The file system is capable of acting as either an SIMD or an MIMD machine. Communication between the individual microcomputers and the control processor is described. The integration of the multiple microcomputers into the primitive operations on a file is examined. Finally, the strategies employed to enhance performance in the multiprogramming environment are discussed.

Keywords: parallel I/O, parallel file system, multiprocessor file system, pario-bib

miller:pfs:
L. L. Miller and S. R. Inglett. Enhancing performance in a parallel file system. Microprocessing and Microprogramming, 40(4):261-274, May 1994.

Keywords: parallel I/O, parallel file system, pario-bib

miller:radar:
Craig Miller, David G. Payne, Thanh N. Phung, Herb Siegel, and Roy Williams. Parallel processing of spaceborne imaging radar data. In Proceedings of Supercomputing '95, San Diego, CA, 1995. IEEE Computer Society Press.

Abstract: We discuss the results of a collaborative project on parallel processing of Synthetic Aperture Radar (SAR) data, carried out between the NASA/Jet Propulsion Laboratory (JPL), the California Institute of Technology (Caltech) and Intel Scalable Systems Division (SSD). Through this collaborative effort, we have successfully parallelized the most compute-intensive SAR correlator phase of the Spaceborne Shuttle Imaging Radar-C/X-Band SAR (SIR-C/X-SAR) code, for the Intel Paragon. We describe the data decomposition, the scalable high-performance I/O model, and the node-level optimizations which enable us to obtain efficient processing throughput. In particular, we point out an interesting double level of parallelization arising in the data decomposition which increases substantially our ability to support ``high volume'' SAR. Results are presented from this code running in parallel on the Intel Paragon. A representative set of SAR data, of size 800 Megabytes, which was collected by the SIR-C/X-SAR instrument aboard NASA's Space Shuttle in 15 seconds, is processed in 55 seconds on the Concurrent Supercomputing Consortium's Paragon XP/S 35+. This compares well with a time of 12 minutes for the current SIR-C/X-SAR processing system at JPL. For the first time, a commercial system can process SIR-C/X-SAR data at a rate which is approaching the rate at which the SIR-C/X-SAR instrument can collect the data. This work has successfully demonstrated the viability of the Intel Paragon supercomputer for processing ``high volume'' Synthetic Aperture Radar data in near real-time.

Keywords: parallel I/O, pario-bib

Comment: Available only on CD-ROM and WWW.

miller:rama:
Ethan L. Miller and Randy H. Katz. RAMA: a file system for massively-parallel computers. In Proceedings of the Twelfth IEEE Symposium on Mass Storage Systems, pages 163-168, 1993.
See also later version miller:rama2.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: The multiprocessor's file system acts as a block cache for tertiary storage. Disk space is broken into ``lines'' of a few MB. Each line has a descriptor telling what blocks it has, and their status. (fileid, offset) hashed to find (disk, linenum). Intrinsic metadata stored at start of each file; positional metadata implicit in hashing, and line descriptors. Sequentiality parameter puts several blocks of a file in the same line, to improve medium-sized requests (otherwise generate lots of request-response net traffic). Not clear on best choice of size. No mention of atomicity wrt concurrent writes to same data. Blocks migrate to tertiary storage as they get old. Fetched on demand, by block (not file). Self-describing blocks have ids in block - leads to screwy block sizes?

miller:rama2:
Ethan L. Miller and Randy H. Katz. RAMA: Easy access to a high-bandwidth massively parallel file system. In Proceedings of the 1995 USENIX Technical Conference, pages 59-70, January 1995.
See also earlier version miller:rama.
See also later version miller:jrama.

Keywords: parallel file system, pario-bib

Comment: Simulation results. RAMA distributes blocks of each file randomly across disks, which are attached to all processor nodes, using a hash function. Thus there is no centralized metadata. The big benefit is uniform performance regardless of access pattern; they found one situation where it was 10% slower than an optimal striped layout, but many cases where they were as much as 4 times faster than bad striped data layouts. So, they can give reasonable performance without the need for programmer- or manager-specified data layouts.

milligan:bifs:
P. Milligan, L. C. Waring, and A. S. C. Lee. BIFS: A filing system for multiprocessor based systems. Microprocessing and Microprogramming, 31:9-12, 1991. Euromicro '90 conference, Amsterdam.

Keywords: multiprocessor file system, pario-bib

Comment: A simple file system for a transputer network, attached to a single disk device. Several procs are devoted to the file system, but really just act as buffers for the host processor that runs the disk. They provide sequential, random access, and indexed files, either byte- or record-oriented. Some prototypes; no results. They add buffering and double buffering, but don't really get into anything interesting.

miya.biblio:
Eugene N. Miya. Multiprocessor/distributed processing bibliography. Computer Architecture News, 13(1):27-29, March 1985. Much updated since then, now kept on-line.

Keywords: bibliography, parallel computing, distributed computing, pario-bib

Comment: This reference is the original publication of Eugene's annotated bibliography. It has grown tremendously and is now huge. Because of the copyright considerations, you can't just nab it off the net, but it is free for the asking from Eugene. Send mail to eugene@nas.nasa.gov.

miyamura:adventure-io:
Tomoshi Miyamura and Shinobu Yoshimura. Generalized I/O data format and interface library for module-based parallel finite element analysis system.. Advances in Engineering Software, 35(3-4):149-159, March 2004.

Abstract: In this paper, a generalized input/output (I/O) data format and library for a module-based parallel finite element analysis system are proposed. The module-based system consists of pre-, main- and post-modules, as well as some common libraries. The present I/O library, called ADVENTURE_IO, and data format are developed specifically for use in parallel high-performance computational mechanics system. These are rather simple compared to other general-purpose I/O systems such as netCDF and HDF5. A simple container called a finite element generic attributes (FEGAs) document enables the handling of almost all the I/O data in a parallel finite element method code. Due to the simplicity of the present system, tuning up the I/O library for a specific parallel environment is easy. Other major features of the present system are: (1) it possesses a generalized collaboration mechanism consisting of multiple modules in a distributed computing environment employing common object request broker architecture, and (2) abstracted data description employed in the FEGA/HDDM_FEGA document enables the development of a unique domain decomposer that can subdivide any kind of input data.

Keywords: data format, finite element method, generalized I/O data, hierarchical domain decomposition, pario-app, pario-bib

mogi:parity:
Kazuhiko Mogi and Masaru Kitsuregawa. Dynamic parity stripe reorganizations for RAID5 disk arrays. In Proceedings of the Third International Conference on Parallel and Distributed Information Systems, pages 17-26, September 1994.

Abstract: RAID5 disk arrays provide high performance and high reliability for reasonable cost. However RAIDS suffers a performance penalty during block updates. We examine the feasibility of using "dynamic parity striping" to improve the performance of block updates. Instead of updating each block independently, this method buffers a number of updates, generates a new stripe composed of the newly updated blocks, then writes the full stripe back to disk. Two implementations are considered in this paper. One is a log-structured file system (LFS) based method and the other is Virtual Striping. Both methods achieve much higher performance than conventional approaches. The performance characteristics of the LFS based method and the Virtual Striping method are clarified.

Keywords: disk array, RAID, disk striping, parallel I/O, pario-bib

mokhoff:pario:
Nicholas Mokhoff. Parallel disk assembly packs 1.5 GBytes, runs at 4 MBytes/s. Electronic Design, pages 45-46, November 1987.

Keywords: parallel I/O, I/O, disk architecture, disk striping, reliability, pario-bib

Comment: Commercially available: Micropolis Systems' Parallel Disk 1800 series. Four disks plus one parity disk, synchronized and byte-interleaved. SCSI interface. Total capacity 1.5 GBytes, sustained transfer rate of 4 MBytes/s. MTTF 140,000 hours. Hard and soft errors corrected in real-time. Failed drives can be replaced while system is running.

molero:modeling:
Xavier Molero, Federico Silla, Vicente Santonja, and José Duato. Modeling and evaluation of Fibre Channel storage area networks. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 31, pages 464-473. IEEE Computer Society Press and Wiley, New York, NY, 2001.

Keywords: storage area network, pario-bib

Comment: Part of jin:io-book.

montague:swift:
Bruce R. Montague. The Swift/RAID distributed transaction driver. Technical Report UCSC-CRL-93-99, UC Santa Cruz, January 1993.

Keywords: RAID, parallel I/O, distributed file system, transaction, pario-bib

Comment: See other Swift papers, e.g., cabrera:pario and long:swift-raid. This paper describes the basic idea of a using a transaction driver to implement RAID over a distributed system. Then it spends most of the time describing the details of the implementation. The basic idea is that processors execute transaction drivers, which provide virtual CPUs to execute scripts of atomic 'instructions', where the instructions are high-level things like read block, write block, compute parity, etc. The transaction driver multiprocesses several scripts if necessary. (Although they describe it in the context of a RAID implementation it certainly could be used for other complex distributed services.) The instructions are often transaction pairs, which compile into a pair of instructions, one for this node and one for the remote node. This node sends the program to the remote node, and they execute them separately, keeping synchronized for transaction pairs when necessary. See also the newer paper in Computing Surveys, long:swift-raid.

moon:declustering:
Bongki Moon and Joel H. Saltz. Scalability analysis of declustering methods for for multidimensional range queries. IEEE Transactions on Knowledge and Data Engineering, 1997. To appear.

Abstract: Efficient storage and retrieval of multi-attribute datasets have become one of the essential requirements for many data-intensive applications. The Cartesian product file has been known as an effective multi-attribute file structure for partial-match and best-match queries. Several heuristic methods have been developed to decluster Cartesian product files across multiple disks to obtain high performance for disk accesses. Though the scalability of the declustering methods becomes increasingly important for systems equipped with a large number of disks, no analytic studies have been done so far. In this paper we derive formulas describing the scalability of two popular declustering methods Disk Modulo and Fieldwise Xor for range queries, which are the most common type of queries. These formulas disclose the limited scalability of the declustering methods and are corroborated by extensive simulation experiments. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the response time of a given range query and to guide the selection of a declustering method under various conditions.

Keywords: parallel I/O, parallel database, declustering, pario-bib

moore:ddio:
Jason A. Moore and Michael J. Quinn. Enhancing disk-directed I/O for fine-grained redistribution of file data. Parallel Computing, 23(4):477-499, June 1997.

Keywords: parallel I/O, multiprocessor file system, interprocessor communication, pario-bib

Comment: They propose several enhancements to disk-directed I/O (see kotz:diskdir) that aim to improve performance on fine-grained distributions, that is, where each block from the disk is broken into small pieces that are scattered among the compute processors. One enhancement combines multiple pieces, possibly from separate disk blocks, into a single message. Another is to use two-phase I/O (see delrosario:two-phase), but to use disk-directed I/O to read data from the disks into CP memories, efficiently, then permute. This latter technique is probably faster than normal two-phase I/O that uses a traditional file system, not disk-directed I/O, for the read.

moore:detection:
Jason A. Moore, Philip J. Hatcher, and Michael J. Quinn. Efficient data-parallel files via automatic mode detection. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 1-14, Philadelphia, May 1996. ACM Press.

Abstract: Parallel languages rarely specify parallel I/O constructs, and existing commercial systems provide the programmer with a low-level I/O interface. We present design principles for integrating I/O into languages and show how these principles are applied to a virtual-processor-oriented language. We illustrate how machine-independent modes are used to support both high performance and generality. We describe an automatic mode detection technique that saves the programmer from extra syntax and low-level file system details. We show how virtual processor file operations, typically small by themselves, are combined into efficient large-scale file system calls. Finally, we present a variety of benchmark results detailing design tradeoffs and the performance of various modes.

Keywords: parallel I/O, data parallelism, pario-bib

Comment: Updated version of TR 95-80-9. See moore:stream. Interesting approach, where they permit a fairly normal fread and fwrite kind of interface, with each VP having its own stream. They choose their own format for the file, and switch between formats (and internal buffering) depending on the particulars of the fread and fwrite parameters. They seem to have good performance, and a familiar interface. They are left with a non-standard file format.

moore:ocean:
Jason A. Moore. Parallel I/O requirements of four oceanography applications. Technical Report 95-80-1, Oregon State University, January 1995.

Abstract: Brief descriptions of the I/O requirements for four production oceanography programs running at Oregon State University are presented. The applications all rely exclusively on array-oriented, sequential file operations. Persistent files are used for checkpointing and movie making, while temporary files are used to store out-of-core data.

Keywords: data parallel, file system workload, parallel I/O, pario-bib

Comment: See moore:detection, moore:stream. Only three pages.

moore:stream:
Jason A. Moore, Philip J. Hatcher, and Michael J. Quinn. Stream*: Fast, flexible, data-parallel I/O. In Parallel Computing: State-of-the-Art and Perspectives (ParCo '95), pages 287-294. Elsevier Science, September 1995.
See also earlier version moore:stream-tr.

Keywords: data parallel, parallel I/O, pario-bib

moore:stream-tr:
Jason A. Moore, Philip J. Hatcher, and Michael J. Quinn. Stream*: Fast, flexible, data-parallel I/O. Technical Report 94-80-13, Oregon State University, 1994. Updated September 1995.
See also later version moore:stream.

Abstract: Although hardware supporting parallel file I/O has improved greatly since the introduction of first-generation parallel computers, the programming interface has not. Each vendor provides a different logical view of parallel files as well as nonportable operations for manipulating files. Neither do parallel languages provide standards for performing I/O. In this paper, we describe a view of parallel files for data-parallel languages, dubbed Stream*, in which each virtual processor writes to and reads from its own stream. In this scheme each virtual processor's I/O operations have the same familiar, unambiguous meaning as in a sequential C program. We demonstrate how I/O operations in Stream* can run as fast as those of vendor-specific parallel file systems on the operations most often encountered in data-parallel programs. We show how this system supports general virtual processor operations for debugging and elemental functions. Finally, we present empirical results from a prototype Stream* system running on a Meiko CS-2 multicomputer.

Keywords: data parallel, parallel I/O, pario-bib

Comment: See moore:stream; nearly identical. See also moore:detection. This paper gives a little bit earlier description of the Stream* idea than does moore:detection, but you'd be pretty much complete just reading moore:detection.

moran:imad:
David Moran, Gary Ditlow, Daria Dooling, Ralph Williams, and Tom Wilkins. Integrated manufacturing and design. In Proceedings of SC99: High Performance Networking and Computing, Portland, OR, November 1999. ACM Press and IEEE Computer Society Press.

Keywords: manufacturing, integrated chip, parallel I/O, pario-bib

Comment: They describe "IMaD", a parallel code that used to support product engineering of full-scale integrated circuits. The code itself simulates the entire integrated circuit to address three primary apects of product engineering: to assure the an IC is manufacturable, to monitor its lifetime yeild and reliability, and to support IC test and failure analysis. The simulation is computationally, memory and I/O intensive. While the paper primarily describes the model and the simulation equations, the talk addressed the issue of parallel I/O, where the data for each processor was written to a separate disk. Not exactly a novel approach, but it emphasises the fact that the I/O requirements are large enough that they used an approach other than a standard serial method.

more:mtio:
Sachin More, Alok Choudhary, Ian Foster, and Ming Q. Xu. MTIO: a multi-threaded parallel I/O system. In Proceedings of the Eleventh International Parallel Processing Symposium, pages 368-373, April 1997.

Abstract: This paper presents the design and evaluation of a multi-threaded runtime library for parallel I/O. We extend the multi-threading concept to separate the compute and I/O tasks in two separate threads of control. Multi-threading in our design permits a) asynchronous I/O even if the underlying file system does not support asynchronous I/O; b) copy avoidance from the I/O thread to the compute thread by sharing address space; and c) a capability to perform collective I/O asynchronously without blocking the compute threads. Further, this paper presents techniques for collective I/O which maximize load balance and concurrency while reducing communication overhead in an integrated fashion. Performance results on IBM SP2 for various data distributions and access patterns are presented. The results show that there is a tradeoff between the amount of concurrency in I/O and the buffer size designated for I/O; and there is an optimal buffer size beyond which benefits of larger requests diminish due to large communication overheads.

Keywords: threads, parallel I/O, pario-bib

moren:controllers:
William D. Moren. Design of controllers is key element in disk subsystem throughput. Computer Technology Review, pages 71-73, Spring 1988.

Keywords: parallel I/O, disk architecture, pario-bib

Comment: A short paper on some basic techniques used by disk controllers to improve throughput: seek optimization, request combining, request queuing, using multiple drives in parallel, scatter/gather DMA, data caching, read-ahead, cross-track read-ahead, write-back caching, segmented caching, reduced latency (track buffering), and format skewing. [Most of these are already handled in Unix file systems.]

mourad:raid:
Antoine N. Mourad, W. Kent Fuchs, and Daniel G. Saab. Performance of redundant disk array organizations in transaction processing environments. In Proceedings of the 1993 International Conference on Parallel Processing, pages I-138-145, St. Charles, IL, 1993. CRC Press.

Keywords: parallel I/O, disk array, pario-bib, RAID

Comment: Transaction-processing workload dominated by small I/Os. They compare RAID 5, Parity Striping (which was designed for TP because it avoids lots of seeks on medium-sized requests, by declustering parity but not data), mirroring, and RAID 0. RAID 5 does better than parity striping due to its load balancing ability on the skewed workload. RAID 5 also better as the load increases.

mowry:prefetch:
Todd C. Mowry, Angela K. Demke, and Orran Krieger. Automatic compiler-inserted I/O prefetching for out-of-core applications. In Proceedings of the 1996 Symposium on Operating Systems Design and Implementation, pages 3-17. USENIX Association, October 1996.
See also later version mowry:jprefetch.

Abstract: Current operating systems offer poor performance when a numeric application's working set does not fit in main memory. As a result, programmers who wish to solve ``out-of-core'' problems efficiently are typically faced with the onerous task of rewriting an application to use explicit I/O operations (e.g., read/write). In this paper, we propose and evaluate a fully-automatic technique which liberates the programmer from this task, provides high performance, and requires only minimal changes to current operating systems. In our scheme, the compiler provides the crucial information on future access patterns without burdening the programmer, the operating system supports non-binding prefetch and release hints for managing I/O, and the operating system cooperates with a run-time layer to accelerate performance by adapting to dynamic behavior and minimizing prefetch overhead. This approach maintains the abstraction of unlimited virtual memory for the programmer, gives the compiler the flexibility to aggressively move prefetches back ahead of references, and gives the operating system the flexibility to arbitrate between the competing resource demands of multiple applications. We have implemented our scheme using the SUIF compiler and the Hurricane operating system. Our experimental results demonstrate that our fully-automatic scheme effectively hides the I/O latency in out-of-core versions of the entire NAS Parallel benchmark suite, thus resulting in speedups of roughly twofold for five of the eight applications, with two applications speeding up by threefold or more.

Keywords: compiler, prefetch, parallel I/O, pario-bib

Comment: Best Paper Award.

moyer:application:
S. Moyer and V. S. Sunderam. Parallel I/O as a parallel application. International Journal of Supercomputer Applications, 9(2):95-107, Summer 1995.

Keywords: parallel I/O, pario-bib

Comment: An overview of PIOUS and its performance. Results for partitioned and self-scheduled access pattern. See other moyer:* papers. The big thing about PIOUS over previous parallel file systems is its internal use of transactions for concurrency control and user-selectable fault-tolerance guarantees, and its optional support of user-level transactions.

moyer:characterize:
Steven A. Moyer and V. S. Sunderam. Characterizing concurrency control performance for the PIOUS parallel file system. Technical Report CSTR-950601, Emory University, June 1995.
See also later version moyer:jcharacterize.

Abstract: Parallel file systems employ data declustering to increase I/O throughput. But because a single read or write operation can generate data accesses on multiple independent storage devices, a concurrency control mechanism must be employed to retain familiar file access semantics. Concurrency control negates some of the performance benefits of data declustering by introducing additional file access overhead. This paper examines the performance characteristics of the transaction-based concurrency control mechanism implemented in the PIOUS parallel file system. Results demonstrate that linearizability of file access operations is provided without loss of scalability or stability.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: ``substantially different material than presented in a previous report,'' moyer:scalable-tr. But it seems like the moyer:scalable IOPADS paper is largely a subset of this TR. He describes how they use volatile transactions, and does some experiments with PIOUS to measure their efficiency. Basically, they use a 2-phase commit protocol, using timeouts to detect deadlock and transaction aborts to remedy the deadlock. Results for partitioned and sequential access patterns.

moyer:jcharacterize:
Steven A. Moyer and V.S. Sunderam. Characterizing concurrency control performance for the PIOUS parallel file system. Journal of Parallel and Distributed Computing, 38(1):81-91, October 1996.
See also earlier version moyer:characterize.

Keywords: parallel I/O, multiprocessor file system, pario-bib

moyer:pario:
Steven A. Moyer and V. S. Sunderam. A parallel I/O system for high-performance distributed computing. In Proceedings of the IFIP WG10.3 Working Conference on Programming Environments for Massively Parallel Distributed Systems, 1994.

Keywords: parallel I/O, parallel file system, workstation cluster, file system interface, pario-bib

Comment: See moyer:pious. A further description of the PIOUS parallel file system for cluster computing. (Beta-test version available for ftp). They support parafiles, which are collections of segments, each segment residing on a different server. The segments can be viewed separately or can be interleaved into a linear sequence using an arbitrary chunk size. They also support transactions to support sequential consistency.

moyer:pious:
Steven A. Moyer and V. S. Sunderam. PIOUS: a scalable parallel I/O system for distributed computing environments. In Proceedings of the Scalable High-Performance Computing Conference, pages 71-78, 1994.

Keywords: parallel I/O, parallel file system, workstation cluster, file system interface, pario-bib

Comment: Basically, I/O for clusters of workstations; ideally, it is parallel, heterogeneous, fault tolerant, etc. File servers are independent, have only a local view. Single server used to coordinate open(). Client libraries implement the API and depend on the servers only for storage mechanism. Servers use transactions internally - but usually these are lightweight transactions, only used for concurrency control and not recovery. Full transactions are supported for times when the user wants the extra fault tolerance. They have files that are in some sense 2-dimensional. Sequential consistency. User-controllable fault tolerance. Performance: 2 clients max out the transport (ethernet). ``Stable'' mode is slow, as is self-scheduled mode. No client caching. See moyer:pario.

moyer:scalable:
Steven A. Moyer and V. S. Sunderam. Scalable concurrency control for parallel file systems. In Proceedings of the IPPS '95 Workshop on Input/Output in Parallel and Distributed Systems, pages 90-106, April 1995.
See also earlier version moyer:scalable-tr.
See also later version moyer:scalable-book.

Abstract: Parallel file systems employ data declustering to increase I/O throughput. As a result, a single read or write operation can generate concurrent data accesses on multiple storage devices. Unless a concurrency control mechanism is employed, familiar file access semantics are likely to be violated. This paper details the transaction-based concurrency control mechanism implemented in the PIOUS parallel file system. Performance results are presented demonstrating that sequential consistency semantics can be provided without loss of system scalability.

Keywords: parallel I/O, pario-bib

Comment: Seems to be a subset of moyer:scalable-tr, and for that matter, moyer:characterize. Results for partitioned access pattern.

moyer:scalable-book:
Steven A. Moyer and V. S. Sunderam. Scalable concurrency control for parallel file systems. In Jain et al. [iopads-book], chapter 10, pages 225-243.
See also earlier version moyer:scalable.

Abstract: Parallel file systems employ data declustering to increase \mbox{I/O} throughput. As a result, a single read or write operation can generate concurrent data accesses on multiple storage devices. Unless a concurrency control mechanism is employed, familiar file access semantics are likely to be violated. This paper details the transaction-based concurrency control mechanism implemented in the PIOUS parallel file system. Performance results are presented demonstrating that sequential consistency semantics can be provided without loss of system scalability.

Keywords: parallel I/O, parallel file system, concurrency control, synchronization, transaction, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

moyer:scalable-tr:
Steven A. Moyer and V. S. Sunderam. Scalable concurrency control for parallel file systems. Technical Report CSTR-950202, Emory University, February 1995.
See also later version moyer:scalable.

Abstract: Parallel file systems employ data declustering to increase I/O throughput. As a result, a single read or write operation can generate concurrent data accesses on multiple storage devices. Unless a concurrency control mechanism is employed, familiar file access semantics are likely to be violated. This paper details the transaction-based concurrency control mechanism implemented in the PIOUS parallel file system. Performance results are presented demonstrating that sequential consistency semantics can be provided without loss of system scalability.

Keywords: parallel I/O, parallel file system, pario-bib

Comment: They describe volatile transactions as a way of providing the appopriate sequential consistency among file-read and -write operations (a feature not provided by most file systems). Their PIOUS library implements these transactions with strict 2-phase locking. They show some performance results, though only on a limited and relatively simple benchmark. If nothing else this paper reminds us all that atomicity of file-read and -write requests should be available to the user (eg, note how they are optional in Vesta). Published as moyer:scalable.

mpi-forum:mpi2:
MPI-2: Extensions to the message-passing interface. {The MPI Forum}, July 1997.
See also earlier version mpi-ioc:mpi-io5.

Keywords: parallel I/O, message-passing, multiprocessor file system interface, pario-bib

Comment: This is the definition of the MPI2 message-passing standard, which includes an interface for parallel I/O. Supercedes mpi-ioc:mpi-io5 and earlier versions. See the MPI2 web page at http://www.mpi-forum.org. The I/O section is at http://www.mpi-forum.org/docs/mpi-20-html/node172.html.

mpi-ioc:mpi-io5:
MPI-IO: a parallel file I/O interface for MPI. {The MPI-IO Committee}, April 1996. Version 0.5.
See also earlier version corbett:mpi-io4.
See also later version mpi-forum:mpi2.

Keywords: parallel I/O, message-passing, multiprocessor file system interface, pario-bib

Comment: Supercedes corbett:mpi-io4 and earlier versions. See the MPI-IO Web page at http://parallel.nas.nasa.gov/MPI-IO/.

mpi2-io:
Message-Passing Interface Forum. MPI-2.0: Extensions to the Message-Passing Interface, chapter 9. MPI Forum, June 1997.

Keywords: MPI, message passing, parallel computing, library, parallel I/O, pario-bib

Comment: Chapter 9 is about I/O extensions.

mueck:multikey:
T. A. Mueck and J. Witzmann. Multikey index support for tuple sets on parallel mass storage systems. In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 136-145, September 1995.

Abstract: The development and evaluation of a tuple set manager (TSM) based on multikey index data structures is a main part of the PARABASE project at the University of Vienna. The TSM provides access to parallel mass storage systems using tuple sets instead of conventional files as the central data structure for application programs. A proof-of-concept prototype TSM is already implemented and operational on an iPSC/2. It supports tuple insert and delete operations as well as exact match, partial match, and range queries at system call level. Available results are from this prototype on the one hand and from various performance evaluation figures. The evaluation results demonstrate the performance gain achieved by the implementation of the tuple set management concept on a parallel mass storage system.

Keywords: parallel database, mass storage, parallel I/O, pario-bib

muller:multi:
Keith Muller and Joseph Pasquale. A high performance multi-structured file system design. In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, pages 56-67, Pacific Grove, CA, 1991. ACM Press.

Keywords: file system, disk striping, disk mirroring, pario-bib

muntz:failure:
Richard R. Muntz and John C. S. Lui. Performance analysis of disk arrays under failure. In Proceedings of the 16th International Conference on Very Large Data Bases, pages 162-173, 1990.

Keywords: disk array, parallel, performance analysis, pario-bib

Comment: Looked at RAID5 when in failure mode. For small-reads workload, could only get 50% of normal. So they decouple cluster size and parity-group size, so that they decluster over more disks than group size; during failure, this causes less of a load increase on surviving disks.

muntz:intro:
Richard R. Muntz and Leana Golubchik. Parallel data servers and applications. Parallel Computing, 24(1):1-4, January 1998.

Keywords: parallel I/O, multimedia, databases, pario-bib

Comment: Introduction to a special issue.

mutisya:cache:
Gerald Mutisya and Bradley M. Broom. Distributed file caching for the AP1000. In Proceedings of the Third Fujitsu-ANU CAP Workshop, November 1992.

Keywords: distributed file system, multiprocessor file system, pario-bib

Comment: See also broom:acacia, broom:impl, lautenbach:pfs, and broom:cap. They examine ways to manage a distributed file cache, without replication. Since there is no replication, the concurrency control problems boil down to providing atomicity for multi-block, multi-site requests. This is handled essentially by serializing the request: send the request to the first site, and have it forward the request from site to site as each block is processed. This works fine but completely serializes all multi-block requests, somewhat defeating the purpose. Thus, they get concurrency between requests, by having multiple servers, but no parallelism within requests.

myllymaki:buffering:
Jussi Myllymaki and Miron Livny. Efficient buffering for concurrent disk tape I/O. Performance Evaluation: An International Journal, 27/28:453-471, 1996. Performance '96.

Keywords: buffering, file caching, tertiary storage, tape robot, file migration, parallel I/O, pario-bib

Comment: Ways to use secondary and tertiary storage in parallel, and buffering mechanisms for applications with concurrent I/O requirements.

nagaraj:hpfs:
U. Nagaraj, U. S. Shukla, and A. Paulraj. Design and evaluation of a high performance file system for message passing parallel computers. In Proceedings of the Fifth International Parallel Processing Symposium, pages 549-554, 1991.

Keywords: multiprocessor file system, pario-bib

Comment: They describe a file system for general message-passing, distributed-memory, separate I/O and compute node, multicomputers. They provide few details, although they cite a lot of their tech reports. There are a few simulation results, but none show anything unintuitive.

nagashima:pario:
Umpei Nagashima, Takashi Shibata, Hiroshi Itoh, and Minoru Gotoh. An improvement of I/O function for auxiliary storage: Parallel I/O for a large scale supercomputing. In Proceedings of the 1990 ACM International Conference on Supercomputing, pages 48-59, 1990.

Keywords: parallel I/O, pario-bib

Comment: Using parallel I/O channels to access striped disks, in parallel from a supercomputer. They chain (i.e., combine) requests to a disk for large contiguous accesses.

nakajo:ionet:
H. Nakajo, S. Ohtani, T. Matsumoto, M. Kohata, K. Hiraki, and Y. Kaneda. An I/O network for architecture of the distributed shared-memory massively parallel computer jump-1. In Proceedings of the 11th ACM International Conference on Supercomputing, pages 253-260. ACM Press, July 1997.

Keywords: collective I/O, multiprocessor file system, parallel I/O, pario-bib

nakajo:jump1:
Hironori Nakajo. A simulation-based evaluation of a disk I/O subsystem for a massively parallel computer: JUMP-1. In Proceedings of the Sixteenth International Conference on Distributed Computer Systems, pages 562-569. IEEE Computer Society Press, May 1996.

Abstract: JUMP-1 is a distributed shared-memory massively parallel computer and is composed of multiple clusters of interconnected network called RDT (Recursive Diagonal Torus). Each cluster in JUMP-1 consists of 4 element processors, secondary cache memories, and 2 MBP (Memory Based Processor) for high-speed synchronization and communication among clusters. The I/O subsystem is connected to a cluster via a high-speed serial link called STAFF-Link. The I/O buffer memory is mapped onto the JUMP-1 global shared-memory to permit each I/O access operation as memory access. In this paper we describe evaluation of the fundamental performance of the disk I/O subsystem using event-driven simulation, and estimated performance with a Video On Demand (VOD) application.

Keywords: parallel I/O, I/O architecture, pario-bib

nastea:optimization:
S. Nastea, V. Sgarciu, and M. Simonca. Parallel I/O performance optimization. Revue Roumaine des Sciences Techniques Serie Electrotechnique et Energetique, 45(3):487-500, 2000.

Keywords: parallel I/O, pario-bib

natarajan:clusterio:
Chita Natarajan and Ravishankar K. Iyer. Measurement and simulation based performance analysis of parallel I/O in a high-performance cluster system. In Proceedings of the 1996 IEEE Symposium on Parallel and Distributed Processing, pages 332-339. IEEE Computer Society Press, October 1996.

Abstract: This paper presents a measurement and simulation based study of parallel I/O in a high-performance cluster system: the Pittsburgh Supercomputing Center (PSC) DEC Alpha Supercluster. The measurements were used to characterize the performance bottlenecks and the throughput limits at the compute and I/O nodes, and to provide realistic input parameters to PioSim, a simulation environment we have developed to investigate parallel I/O performance issues in cluster systems. PioSim was used to obtain a detailed characterization of parallel I/O performance, in the high performance cluster system, for different regular access patterns and different system configurations. This paper also explores the use of local disks at the compute nodes for parallel I/O, and finds that the local disk architecture outperforms the traditional parallel I/O over remote I/O node disks architecture, even when as much as 68-75% of the requests from each compute node goes to remote disks.

Keywords: performance analysis, parallel I/O, pario-bib

ncr:3600:
NCR 3600 product description. Technical Report ST-2119-91, NCR, San Diego, September 1991.

Keywords: multiprocessor architecture, MIMD, parallel I/O, pario-bib

Comment: Has 1-32 50MHz Intel 486 processors. Parallel independent disks on the disk nodes, separate from the processor nodes. Tree interconnect. Aimed at database applications.

ng:diskarray:
Spencer Ng. Some design issues of disk arrays. In Proceedings of IEEE Compcon, pages 137-142, Spring 1989. San Francisco, CA.

Keywords: parallel I/O, disk array, pario-bib

Comment: Discusses disk arrays and striping. Transfer size is important to striping success: small size transfers are better off with independent disks. Synchronized rotation is especially important for small transfer sizes, since then the increased rotational delays dominate. Fine grain striping involves less assembly/disassembly delay, but coarse grain (block) striping allows for request parallelism. Fine grain striping wastes capacity due to fixed size formatting overhead. He also derives exact MTTF equation for 1-failure tolerance and on-line repair.

ng:interleave:
S. Ng, D. Lang, and R. Selinger. Trade-offs between devices and paths in achieving disk interleaving. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 196-201, 1988.

Keywords: parallel I/O, disk architecture, disk caching, I/O bottleneck, pario-bib

Comment: Compares four different ways of restructuring IBM disk controllers and channels to obtain more parallelism. They use parallel heads or parallel actuators. The best results come when they replicate the control electronics to maintain the number of data paths through the controller. Otherwise the controller bottleneck reduces performance. Generally, for large or small transfer sizes, parallel heads with replication gave better performance.

nicastro:fft:
L. Nicastro and N. D'Amico. An optimized mass storage FFT for vector computers. Parallel Computing, 21:423-432, March 1995.

Keywords: out-of-core algorithm, parallel I/O algorithm, scientific computing, vector computer, pario-bib

Comment: They describe an out-of-core FFT algorithm for vector computers (one disk, one vector processor). They implemented it on a Convex and show good performance. Basically, the segment the array, do FFTs on each segment, and do some transposing and other stuff to combine the segments. Each segment is basically a memoryload. Seems parallelizable too.

nickolls:dpio:
John R. Nickolls and Ernie Rael. Data parallel Unix input/output for a massively parallel processor. Technical Report MP/P-17.93, MasPar Computer Corporation, 1993.

Keywords: Unix, parallel I/O, data parallel, pario-bib

Comment: Cite nickolls:maspar-io.

nickolls:maspar-io:
John R. Nickolls. The MasPar scalable Unix I/O system.. In Proceedings of the Eighth International Parallel Processing Symposium, pages 390-394, Cancun, Mexico, April 1994.

Abstract: Scalable parallel computers require I/O balanced with computational power to solve data-intensive problems. Distributed memory architectures call for I/O hardware and software beyond those of conventional scalar systems.

This paper introduces the MasPar I/O system, designed to provide balanced and and scalable data-parallel Unix I/O. The architecture and implementation of the I/O hardware and software are described. Key elements include parallel access to conventional Unix file descriptors and a self-routing multistage network coupled with a buffer memory for flexible parallel I/O transfers. Performance measurements are presented for parallel Unix I/O with a scalable RAID disk array, a RAM disk, and a HIPPI interconnect.

Keywords: parallel I/O, multiprocessor file system, SIMD, pario-bib

Comment: This provides the definitive reference for the Maspar parallel-I/O architecture and file system. This paper includes a brief discussion of the interface and performance results. Also includes some HIPPI interface performance results. This paper is the conference version of nickolls:dpio, so cite this one.

nieplocha:arrays:
Jarek Nieplocha and Ian Foster. Disk resident arrays: An array-oriented I/O library for out-of-core computations. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 196-204. IEEE Computer Society Press, October 1996.
See also later version foster:arrays.

Abstract: In out-of-core computations, disk storage is treated as another level in the memory hierarchy, below cache, local memory, and (in a parallel computer) remote memories. However the tools used to manage this storage are typically quite different from those used to manage access to local and remote memory. This disparity complicates implementation of out-of-core algorithms and hinders portability. We describe a programming model that addresses this problem. This model allows parallel programs to use essentially the same mechanisms to manage the movement of data between any two adjacent levels in a hierarchical memory system. We take as our starting point the Global Arrays shared-memory model and library, which support a variety of operations on distributed arrays, including transfer between local and remote memories. We show how this model can be extended to support explicit transfer between global memory and secondary storage, and we define a Disk Resident Arrays Library that supports such transfers. We illustrate the utility of the resulting model with two applications, an out-of-core matrix multiplication and a large computational chemistry program. We also describe implementation techniques on several parallel computers and present experimental results that demonstrate that the Disk Resident Arrays model can be implemented very efficiently on parallel computers.

Keywords: parallel I/O, pario-bib

nieplocha:chemio:
Jarek Nieplocha, Ian Foster, and Rick Kendall. ChemIO: High-performance parallel I/O for computational chemistry applications. The International Journal of High Performance Computing Applications, 12(3):345-363, Fall 1998.
See also earlier version foster:chemio.

Abstract: Recent developments in I/O systems on scalable parallel computers have sparked renewed interest in out-of-core methods for computational chemistry. These methods can improve execution time significantly relative to "direct" methods, which perform many redundant computations. However, the widespread use of such out-of-core methods requires efficient and portable implementations of often complex I/O patterns. The ChemIO project has addressed this problem by defining an I/O interface that captures the I/O patterns found in important computational chemistry applications and by providing high-performance implementations of this interface on multiple platforms. This development not only broadens the user community for parallel I/O techniques but also provides new insights into the functionality required in general-purpose scalable I/O libraries and the techniques required to achieve high performance I/O on scalable parallel computers.

Keywords: parallel I/O application, pario-bib

Comment: In a Special Issue on I/O in Parallel Applications, volume 12, numbers 3 and 4.

nieplocha:distant:
Jarek Nieplocha, Ian Foster, and Holger Dachsel. Distant I/O: One-sided access to secondary storage on remote processors. In Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, pages 148-154. IEEE Computer Society Press, July 1998.

Abstract: We propose a new parallel, noncollective I/O strategy called Distant I/O that targets clustered computer systems in which disks are attached to compute nodes. Distant I/O allows one-sided access to remote secondary storage without installing server processes or daemons on remote compute nodes. We implemented this model using Active Messages and demonstrated its performance advantages over the PIOFS parallel filesystem for an I/O-intensive parallel application on the IBM SP.

Keywords: parallel I/O, pario-bib, remote I/O

nieuwejaar:galley:
Nils Nieuwejaar and David Kotz. The Galley parallel file system. In Proceedings of the 10th ACM International Conference on Supercomputing, pages 374-381, Philadelphia, May 1996. ACM Press.
See also later version nieuwejaar:jgalley-tr.

Abstract: As the I/O needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. This interface conceals the parallelism within the file system, which increases the ease of programmability, but makes it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. Furthermore, most current parallel file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic parallel workloads. We discuss Galley's file structure and application interface, as well as an application that has been implemented using that interface.

Keywords: parallel file system, parallel I/O, multiprocessor file system interface, pario-bib, dfk

Comment: See also nieuwejaar:galley-perf. Also available at http://www.acm.org/pubs/citations/proceedings/supercomputing/237578/p374-nieuwejaar/

nieuwejaar:galley-perf:
Nils Nieuwejaar and David Kotz. Performance of the Galley parallel file system. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 83-94, Philadelphia, May 1996. ACM Press.
See also later version nieuwejaar:jgalley-tr.

Abstract: As the I/O needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. This interface conceals the parallelism within the file system, which increases the ease of programmability, but makes it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. Furthermore, most current parallel file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic parallel workloads. Initial experiments, reported in this paper, indicate that Galley is capable of providing high-performance I/O to applications that access data in patterns that have been observed to be common.

Keywords: parallel file system, parallel I/O, multiprocessor file system interface, pario-bib, dfk

Comment: See also nieuwejaar:galley.

nieuwejaar:jgalley:
Nils Nieuwejaar and David Kotz. The Galley parallel file system. Parallel Computing, 23(4):447-476, June 1997.
See also earlier version nieuwejaar:jgalley-tr.

Abstract: Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scientific applications. Many multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. We discuss Galley's file structure and application interface, as well as the performance advantages offered by that interface.

Keywords: parallel file system, parallel I/O, multiprocessor file system interface, pario-bib, dfk

Comment: A revised version of nieuwejaar:jgalley-tr, which is a combination of nieuwejaar:galley and nieuwejaar:galley-perf.

nieuwejaar:jgalley-tr:
Nils Nieuwejaar and David Kotz. The Galley parallel file system. Technical Report PCS-TR96-286, Dept. of Computer Science, Dartmouth College, May 1996.
See also earlier version nieuwejaar:galley.
See also later version nieuwejaar:jgalley.

Abstract: Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scientific applications. Many multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. We discuss Galley's file structure and application interface, as well as the performance advantages offered by that interface.

Keywords: parallel file system, parallel I/O, multiprocessor file system interface, pario-bib, dfk

nieuwejaar:strided:
Nils Nieuwejaar and David Kotz. A multiprocessor extension to the conventional file system interface. Technical Report PCS-TR94-230, Dept. of Computer Science, Dartmouth College, September 1994.
See also later version nieuwejaar:strided2-tr.

Abstract: As the I/O needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. By tracing all the activity of a parallel file system in a production, scientific computing environment, we show that many applications exhibit highly regular, but non-consecutive I/O access patterns. Since the conventional interface does not provide an efficient method of describing these patterns, we present an extension which supports strided and nested-strided I/O requests.

Keywords: parallel I/O, multiprocessor file system, pario-bib, dfk

nieuwejaar:strided2:
Nils Nieuwejaar and David Kotz. Low-level interfaces for high-level parallel I/O. In Proceedings of the IPPS '95 Workshop on Input/Output in Parallel and Distributed Systems, pages 47-62, April 1995.
Identical to nieuwejaar:strided2-tr.
See also later version nieuwejaar:strided2-book.

Abstract: As the I/O needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. By tracing all the activity of a parallel file system in a production, scientific computing environment, we show that many applications exhibit highly regular, but non-consecutive I/O access patterns. Since the conventional interface does not provide an efficient method of describing these patterns, we present three extensions to the interface that support strided, nested-strided, and nested-batched I/O requests. We show how these extensions can be used to express common access patterns.

Keywords: parallel I/O, multiprocessor file system, pario-bib, dfk

Comment: Identical to revised TR95-253, nieuwejaar:strided2-tr. Cite nieuwejaar:strided2-book.

nieuwejaar:strided2-book:
Nils Nieuwejaar and David Kotz. Low-level interfaces for high-level parallel I/O. In Ravi Jain, John Werth, and James C. Browne, editors, Input/Output in Parallel and Distributed Computer Systems, volume 362 of The Kluwer International Series in Engineering and Computer Science, chapter 9, pages 205-223. Kluwer Academic Publishers, 1996.
See also earlier version nieuwejaar:strided2.

Abstract: As the I/O needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. By tracing all the activity of a parallel file system in a production, scientific computing environment, we show that many applications exhibit highly regular, but non-consecutive I/O access patterns. Since the conventional interface does not provide an efficient method of describing these patterns, we present three extensions to the interface that support strided, nested-strided, and nested-batched I/O requests. We show how these extensions can be used to express common access patterns.

Keywords: parallel I/O, multiprocessor file system, pario-bib, dfk

Comment: Part of a whole book on parallel I/O; see iopads-book and nieuwejaar:strided2 (which is not much different).

nieuwejaar:strided2-tr:
Nils Nieuwejaar and David Kotz. Low-level interfaces for high-level parallel I/O. Technical Report PCS-TR95-253, Dept. of Computer Science, Dartmouth College, March 1995. Revised 4/18/95 and appeared in IOPADS workshop at IPPS '95.
Identical to nieuwejaar:strided2.
See also earlier version nieuwejaar:strided.

Abstract: As the I/O needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. By tracing all the activity of a parallel file system in a production, scientific computing environment, we show that many applications exhibit highly regular, but non-consecutive I/O access patterns. Since the conventional interface does not provide an efficient method of describing these patterns, we present three extensions to the interface that support strided, nested-strided, and nested-batched I/O requests. We show how these extensions can be used to express common access patterns.

Keywords: parallel I/O, multiprocessor file system, pario-bib, dfk

Comment: After revision, identical to nieuwejaar:strided2.

nieuwejaar:thesis:
Nils A. Nieuwejaar. Galley: A New Parallel File System for Parallel Applications. PhD thesis, Dept. of Computer Science, Dartmouth College, November 1996. Available as Dartmouth Technical Report PCS-TR96-300.

Abstract: Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scientific applications. Most multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access those multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated application and library programmers to use knowledge about their I/O to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support.

In this work we examine current multiprocessor file systems, as well as how those file systems are used by scientific applications. Contrary to the expectations of the designers of current parallel file systems, the workloads on those systems are dominated by requests to read and write small pieces of data. Furthermore, rather than being accessed sequentially and contiguously, as in uniprocessor and supercomputer workloads, files in multiprocessor file systems are accessed in regular, structured, but non-contiguous patterns.

Based on our observations of multiprocessor workloads, we have designed Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. In this work, we introduce Galley and discuss its design and implementation. We describe Galley's new three-dimensional file structure and discuss how that structure can be used by parallel applications to achieve higher performance. We introduce several new data-access interfaces, which allow applications to explicitly describe the regular access patterns we found to be common in parallel file system workloads. We show how these new interfaces allow parallel applications to achieve tremendous increases in I/O performance. Finally, we discuss how Galley's new file structure and data-access interfaces can be useful in practice.

Keywords: parallel I/O, multiprocessor file system, file system workload characterization, file access patterns, file system interface, pario-bib

nieuwejaar:workload:
Nils Nieuwejaar, David Kotz, Apratim Purakayastha, Carla Schlatter Ellis, and Michael Best. File-access characteristics of parallel scientific workloads. IEEE Transactions on Parallel and Distributed Systems, 7(10):1075-1089, October 1996.
See also earlier version nieuwejaar:workload-tr.

Abstract: Phenomenal improvements in the computational performance of multiprocessors have not been matched by comparable gains in I/O system performance. This imbalance has resulted in I/O becoming a significant bottleneck for many scientific applications. One key to overcoming this bottleneck is improving the performance of multiprocessor file systems.

The design of a high-performance multiprocessor file system requires a comprehensive understanding of the expected workload. Unfortunately, until recently, no general workload studies of multiprocessor file systems have been conducted. The goal of the CHARISMA project was to remedy this problem by characterizing the behavior of several production workloads, on different machines, at the level of individual reads and writes. The first set of results from the CHARISMA project describe the workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5. This paper is intended to compare and contrast these two workloads for an understanding of their essential similarities and differences, isolating common trends and platform-dependent variances. Using this comparison, we are able to gain more insight into the general principles that should guide multiprocessor file-system design.

Keywords: parallel I/O, file system workload, workload characterization, file access pattern, multiprocessor file system, dfk, pario-bib

Comment: See also kotz:workload, nieuwejaar:strided, ap:workload.

nieuwejaar:workload-tr:
Nils Nieuwejaar, David Kotz, Apratim Purakayastha, Carla Schlatter Ellis, and Michael Best. File-access characteristics of parallel scientific workloads. Technical Report PCS-TR95-263, Dept. of Computer Science, Dartmouth College, August 1995.
See also earlier version kotz:workload.
See also later version nieuwejaar:workload.

Abstract: Phenomenal improvements in the computational performance of multiprocessors have not been matched by comparable gains in I/O system performance. This imbalance has resulted in I/O becoming a significant bottleneck for many scientific applications. One key to overcoming this bottleneck is improving the performance of parallel file systems.

The design of a high-performance parallel file system requires a comprehensive understanding of the expected workload. Unfortunately, until recently, no general workload studies of parallel file systems have been conducted. The goal of the CHARISMA project was to remedy this problem by characterizing the behavior of several production workloads, on different machines, at the level of individual reads and writes. The first set of results from the CHARISMA project describe the workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5. This paper is intended to compare and contrast these two workloads for an understanding of their essential similarities and differences, isolating common trends and platform-dependent variances. Using this comparison, we are able to gain more insight into the general principles that should guide parallel file-system design.

Keywords: parallel I/O, file system workload, workload characterization, file access pattern, multiprocessor file system, dfk, pario-bib

Comment: See also nieuwejaar:strided, ap:workload.

ninghui:pfs:
Sun Ninghui. The design of parallel file systems. Chinese Journal of Computers, 17(12):938-945, December 1994. In Chinese.

Keywords: parallel file systems, parallel I/O, pario-bib

Comment: From the abstract, it doesn't appear to offer anything new, but it's hard to tell.

nishino:sfs:
H. Nishino, S. Naka, and K Ikumi. High performance file system for supercomputing environment. In Proceedings of Supercomputing '89, pages 747-756, 1989.

Keywords: supercomputer, file system, parallel I/O, pario-bib

Comment: A modification to the Unix file system to allow for supercomputer access. Workload: file size from few KB to few GB, I/O operation size from few bytes to hundreds of MB. Generally programs split into I/O-bound and CPU-bound parts. Sequential and random access. Needs: giant files (bigger than device), peak hardware performance for large files, NFS access. Their FS is built into Unix ``transparently''. Space allocated in clusters, rather than blocks; clusters might be as big as a cylinder. Allows for efficient, large files. Mentions parallel disks as part of a ``virtual volume'' but does not elaborate. Prefetching within a cluster.

nitzberg:bcollective:
Bill Nitzberg and Virginia Lo. Collective buffering: Improving parallel I/O performance. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 19, pages 271-281. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version nitzberg:collective.

Keywords: parallel I/O, collective I/O, pario-bib

Comment: Part of jin:io-book; reformatted version of nitzberg:collective.

nitzberg:cfs:
Bill Nitzberg. Performance of the iPSC/860 Concurrent File System. Technical Report RND-92-020, NAS Systems Division, NASA Ames, December 1992.
See also later version krystynak:pario.

Abstract: Typical scientific applications require vast amounts of processing power coupled with significant I/O capacity. Highly parallel computer systems can provide processing power at low cost, but tend to lack I/O capacity. By evaluating the performance and scalability of the Intel iPSC/860 Concurrent File System (CFS), we can get an idea of the current state of parallel I/O performance. I ran three types of tests on the iPSC/860 system at the Numerical Aerodynamic Simulation facility (NAS): broadcast, simulating initial data loading; partitioned, simulating reading and writing a one-dimensional decomposition; and interleaved, simulating reading and writing a two-dimensional decomposition.

The CFS at NAS can sustain up to 7 megabytes per second writing and 8 megabytes per second reading. However, due to the limited disk cache size, partitioned read performance sharply drops to less than 1 megabyte per second on 128 nodes. In addition, interleaved read and write performance show a similar drop in performance for small block sizes. Although the CFS can sustain 70-80% of peak I/O throughput, the I/O performance does not scale with the number of nodes.

Obtaining maximum performance may require significant programming effort: pre-allocating files, overlapping computation and I/O, using large block sizes, and limiting I/O parallelism. A better approach would be to attack the problem by either fixing the CFS (e.g., add more cache to the I/O nodes), or hiding its idiosyncracies (e.g., implement a parallel I/O library).

Keywords: Intel, parallel file system, performance measurement, parallel I/O, pario-bib

Comment: Straightforward measurements of an iPSC/860 with 128 compute nodes, 10 I/O nodes, and 10 disks. This is a bigger system than has been measured before. Has some basic MB/s measurements for some features in Tables 1-2. CFS bug prevents more than 2 asynch requests at a time. Another bug forced random-writes to use preallocated files. For low number of procs, they weren't able to pull the full disk bandwidth. Cache thrashing caused problems when they had a large number of procs, because each read prefetched 8 blocks, which were flushed by some other proc doing a subsequent read. Workaround by synchronizing procs to limit concurrency. Increasing cache size is the right answer, but is not scalable.

nitzberg:collective:
Bill Nitzberg and Virginia Lo. Collective buffering: Improving parallel I/O performance. In Proceedings of the Sixth IEEE International Symposium on High Performance Distributed Computing, pages 148-157, Portland, OR, August 1997. IEEE Computer Society Press.
See also later version nitzberg:bcollective.

Abstract: "Parallel I/O" is the support of a single parallel application run on many nodes; application data is distributed among the nodes, and is read or written to a single logical file, itself spread across nodes and disks. Parallel I/O is a mapping problem from the data layout in node memory to the file layout on disks. Since the mapping can be quite complicated and involve significant data movement, optimizing the mapping is critical for performance.

We discuss our general model of the problem, describe four Collective Buffering algorithms we designed, and report experiments testing their performance on an Intel Paragon and an IBM SP2 both housed at NASA Ames Research Center. Our experiments show improvements of up to two order of magnitude over standard techniques and the potential to deliver peak performance with minimal hardware support.

Keywords: parallel I/O, collective I/O, pario-bib

nitzberg:sc94tutorial:
Bill Nitzberg and Samuel A. Fineberg. Parallel I/O on highly parallel systems- supercomputing '94 tutorial M11 notes. Technical Report NAS-94-005, NASA Ames Research Center, November 1994.
See also later version nitzberg:sc95tutorial.

Abstract: Typical scientific applications require vast amounts of processing power coupled with significant I/O capacity. Highly parallel computer systems provide floating point processing power at low cost, but efficiently supporting a scientific workload also requires commensurate I/O performance. In order to achieve high I/O performance, these systems utilize parallelism in their I/O subsystems-supporting concurrent access to files by multiple nodes of a parallel application, and striping files across multiple disks. However, obtaining maximum I/O performance can require significant programming effort.

This tutorial presents a snapshot of the state of I/O on highly parallel systems by comparing the well-balanced I/O performance of a traditional vector supercomputer (the Cray Y/MP C90) with the I/O performance of various highly parallel systems (Cray T3D, IBM SP-2, Intel iPSC/860 and Paragon, and Thinking Machines CM-5). In addition, the tutorial covers benchmarking techniques for evaluating current parallel I/O systems and techniques for improving parallel I/O performance. Finally, the tutorial presents several high level parallel I/O libraries and shows how they can help application programmers improve I/O performance.

Keywords: parallel I/O, tutorial, pario-bib

nitzberg:sc95tutorial:
Bill Nitzberg and Samuel A. Fineberg. Parallel I/O on highly parallel systems- supercomputing '95 tutorial M6 notes. Technical Report NAS-95-022, NASA Ames Research Center, December 1995.
See also later version nitzberg:sc94tutorial.

Abstract: Typical scientific applications require vast amounts of processing power coupled with significant I/O capacity. Highly parallel computer systems provide floating-point processing power at low cost, but efficiently supporting a scientific workload also requires commensurate I/O performance. To achieve high I/O performance, these systems use parallelism in their I/O subsystems, supporting concurrent access to files by multiple nodes of a parallel application and striping files across multiple disks. However, obtaining maximum I/O performance can require significant programming effort. This tutorial will present a comprehensive survey of the state of the art in parallel I/O from basic concepts to recent advances in the research community. Requirements, interfaces, architectures, and performance will be illustrated using concrete examples from commercial offerings (Cray T3D, IBM SP-2, Intel Paragon, Meiko CS-2, and workstation clusters) and academic research projects (MPI-IO, Panda, PASSION, PIOUS, and Vesta). The material covered is roughly 30% beginner, 60% intermediate, and 10% advanced.

Keywords: parallel I/O, tutorial, pario-bib

nitzberg:thesis:
William J. Nitzberg. Collective Parallel I/O. PhD thesis, Department of Computer and Information Science, University of Oregon, December 1995.

Abstract: Parallel I/O, the process of transferring a global data structure distributed among compute nodes to a file striped across storage devices, can be quite complicated and involve a significant amount of data movement. Optimizing parallel I/O with respect to data distribution, file layout, and machine architecture is critical for performance. In this work, we propose a solution to both the performance and portability problems plaguing the wide acceptance of distributed memory parallel computers for scientific computing: a collective parallel I/O interface and efficient algorithms to implement it. A collective interface allows the programmer to specify a file access as a high-level global operation rather than as a series of seeks and writes. This not only provides a more natural interface for the programmer, but also provides the system with both the opportunity and the semantic information necessary to optimize the file operation.

We attack this problem in three steps: we evaluate an early parallel I/O system, the Intel iPSC/860 Concurrent File System; we design and analyze the performance of two classes of algorithms taking advantage of collective parallel I/O; and we design MPI-IO, a collective parallel I/O interface likely to become the standard for portable parallel I/O.

The collective I/O algorithms fall into two broad categories: data block scheduling and collective buffering. Data block scheduling algorithms attempt to schedule the individual data transfers to minimize resource contention and to optimize for particular hardware characteristics. We develop and evaluate three data block scheduling algorithms: Grouping, Random, and Sliding Window. The data block scheduling algorithms improved performance by as much as a factor of eight. The collective buffering algorithms permute the data before writing or after reading in order to combine small file accesses into large blocks. We design and test a series of four collective buffering algorithms and demonstrate improvement in performance by two orders of magnitude over naive file I/O for the hardest, three-dimensional distributions.

Keywords: parallel I/O, parallel algorithm, file system interface, pario-bib

Comment: See also nitzberg:cfs and corbett:mpi-overview.

no:file-db:
Jaechun No, Rajeev Thakur, and Alok Choudhary. Integrating parallel file I/O and database support for high-performance scientific data management. In Proceedings of SC2000: High Performance Networking and Computing, Dallas, TX, November 2000. IEEE Computer Society Press. To appear.

Abstract: Many scientific applications have large I/O requirements, in terms of both the size of data and the number of files or data sets. Management, storage, efficient access, and analysis of this data present an extremely challenging task. Traditionally, two different solutions are used for this problem: file I/O or databases. File I/O can provide high performance but is tedious to use with large numbers of files and large and complex data sets. Databases can be convenient, flexible, and powerful but do not perform and scale well for parallel supercomputing applications. We have developed a software system, called Scientific Data Manager (SDM), that aims to combine the good features of both file I/O and databases. SDM provides a high-level API to the user and, internally, uses a parallel file system to store real data and a database to store application-related metadata. SDM takes advantage of various I/O optimizations available in MPI-IO, such as collective I/O and noncontiguous requests, in a manner that is transparent to the user. As a result, users can write and retrieve data with the performance of parallel file I/O, without having to bother with the details of actually performing file I/O.

In this paper, we describe the design and implementation of SDM. With the help of two parallel application templates, ASTRO3D and an Euler solver, we illustrate how some of the design criteria affect performance.

Keywords: scientific computing, database, parallel I/O, pario-bib

no:irregular-io:
Jaechun No, Sung soon Park, Jesus Carretero, Alok Choudhary, and Pang Chen. Design and implementation of a parallel I/O runtime system for irregular applications. In Proceedings of the Joint International Parallel Processing Symposium and IEEE Symposium on Parallel and Distributed Processing, pages 280-284. IEEE Computer Society Press, March 1998.
See also later version no:jirregular.

Keywords: parallel I/O, pario-bib

Comment: see no:irregular2 and no:irregular.

no:irregular2:
Jaechun No, J. Carretero, and Alok Choudhary. High performance parallel I/O schemes for irregular applications on clusters of workstations. In Proceedings of the Seventh High-Performance Computing and Networking Conference, pages 1117-1126, 1999.
See also earlier version no:irregular-io.

Abstract: Due to the convergence of the fast microprocessors with low latency and high bandwidth communication networks, clusters of workstations are being used for high-performance computing. In this paper we present the design and implementation of a runtime system to support irregular applications on clusters of workstations, called "Collective I/O Clustering". The system provides a friendly programming model for performing I/O in irregular applications on clusters of workstations, and is completely integrated with the underlying communication and I/O system. All the performance results were obtained on the IBM-SP machine, located at Argonne National Labs

Keywords: parallel I/O, irregular applications, pario-bib

no:jirregular:
Jaechun No, Jesus Carretero, Sung soon Park, Alok Choudhary, and Pang Chen. Design and implementation of a parallel I/O runtime system for irregular applications. Journal of Parallel and Distributed Computing, 62(2):193-220, February 2002.
See also earlier version no:irregular-io.

Keywords: parallel I/O, pario-bib

nodine:deterministic:
M. H. Nodine and J. S. Vitter. Deterministic distribution sort in shared and distributed memory multiprocessors. In Proceedings of the Fifth Symposium on Parallel Algorithms and Architectures, pages 120-129, Velen, Germany, 1993.

Abstract: We present an elegant deterministic load balancing strategy for distribution sort that is applicable to a wide variety of parallel disks and parallel memory hierarchies with both single and parallel processors. The simplest application of the strategy is an optimal deterministic algorithm for external sorting with multiple disks and parallel processors. In each input/output (I/O) operation, each of the $D \geq 1$ disks can simultaneously transfer a block of $B$ contiguous records. Our two measures of performance are the number of I/Os and the amount of work done by the CPU(s); our algorithm is simultaneously optimal for both measures. We also show how to sort deterministically in parallel memory hierarchies. When the processors are interconnected by any sort of a PRAM, our algorithms are optimal for all parallel memory hierarchies; when the interconnection network is a hypercube, our algorithms are either optimal or best-known.

Keywords: parallel I/O algorithm, sorting, shared memory, pario-bib

Comment: Short version of nodine:sort2 and nodine:sortdisk.

nodine:greed:
Mark H. Nodine and Jeffrey Scott Vitter. Greed sort: An optimal external sorting algorithm for multiple disks. Technical Report CS-91-20, Brown University, 1992. A summary appears in SPAA '91.

Abstract: We present an optimal deterministic algorithm for external sorting on multiple disks. Our measure of performance is the number of input/output (I/O) operations. In each I/O, each disk can simultaneously transfer a block of data. Our algorithm improves upon a recent randomized optimal algorithm and the (non-optimal) commonly used technique of disk striping. The code is simple enough for easy implementation.

Keywords: parallel I/O algorithms, sorting, pario-bib

Comment: Summary is nodine:sort. This is revision of CS-91-04.

nodine:loadbalance:
Mark H. Nodine and Jeffrey Vitter. Load balancing paradigms for optimal use of parallel disks and parallel memory hierarchies. In Proceedings of the 1993 DAGS/PC Symposium, pages 26-39, Hanover, NH, June 1993. Dartmouth Institute for Advanced Graduate Studies.

Abstract: We present several load balancing paradigms pertinent to optimizing I/O performance with disk and processor parallelism. We use sorting as our canonical application to illustrate the paradigms, and we survey a wide variety of applications in computational geometry. The use of parallel disks can help overcome the I/O bottleneck in sorting if the records in each read or write are evenly balanced among the disks. There are three known load balancing paradigms that lead to optimal I/O algorithms: using randomness to assign blocks to disks, using the disks predominantly independently, and deterministically balancing the blocks by matching. In this report, we describe all of these techniques in detail and compare their relative advantages. We show how randomized and deterministic balancing can be extended to provide sorting algorithms that are optimal both in terms of the number of I/Os and the internal processing time for parallel-processing machines with scalable I/O subsystems and with parallel memory hierarchies. We also survey results achieving optimal performance in the these models for a large range of online and batch problems in computational geometry.

Keywords: parallel I/O algorithm, memory hierarchy, load balance, sorting, pario-bib

Comment: Invited speaker: Jeffrey Vitter.

nodine:opt-sort:
Mark H. Nodine and Jeffrey Scott Vitter. Paradigms for optimal sorting with multiple disks. In Proceedings of the Twenty-Sixth Annual Hawaii International Conference on System Sciences, volume I, pages 50-59, 1993.

Keywords: parallel I/O algorithms, sorting, pario-bib

Comment: They compare three techniques for balancing I/O across parallel disks, using sorting as an example. The three are randomization, using disks independently (as in balance sort), or tricky matching techniques as in balance sort. They also look at parallel memory hierarchies. All in all, it seems to be mostly a survey of techniques in earlier papers.

nodine:sort:
Mark H. Nodine and Jeffrey Scott Vitter. Large-scale sorting in parallel memories. In Proceedings of the Third Symposium on Parallel Algorithms and Architectures, pages 29-39, 1991.

Keywords: external sorting, file access pattern, parallel I/O, pario-bib

Comment: Describes algorithms for external sorting that are optimal in the number of I/Os. Proposes a couple of fairly-realistic memory hierarchy models. See also journal version vitter:uniform.

nodine:sort2:
Mark H. Nodine and Jeffrey Scott Vitter. Optimal deterministic sorting in parallel memory hierarchies. Technical Report CS-92-38, Brown University, August 1992.

Keywords: parallel I/O algorithms, parallel memory, sorting, pario-bib

Comment: see nodine:deterministic.

nodine:sortdisk:
Mark H. Nodine and Jeffrey Scott Vitter. Optimal deterministic sorting on parallel disks. Technical Report CS-92-08, Brown University, August 1992.

Keywords: parallel I/O algorithms, sorting, pario-bib

Comment: see nodine:deterministic.

nurmi:atm:
Marc A. Nurmi, William E. Bejcek, Rod N. Gregoire, K. C. Liu, and Mark D. Pohl. Automatic management of CPU and I/O bottlenecks in distributed applications on ATM networks. In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, pages 481-489. IEEE Computer Society Press, August 1996.

Abstract: Existing parallel programming environments for networks of workstations improve the performance of computationally intensive applications by using message passing or virtual shared memory to alleviate CPU bottlenecks. This paper describes an approach based on message passing that addresses both CPU and I/O bottlenecks for a specific class of distributed applications on ATM networks. ATM provides the bandwidth required to utilize multiple I/O channels in parallel. This paper also describes an environment based on distributed process management and centralized application management that implements the approach. The environment adds processes to a running application when necessary to alleviate CPU and I/O bottlenecks while managing process connections in a manner that is transparent to the application.

Keywords: parallel I/O, ATM, parallel networking, pario-bib

ober:seismic:
Curtis Ober, Ron Oldfield, John VanDyke, and David Womble. Seismic imaging on massively parallel computers. Technical Report SAND96-1112, Sandia National Laboratories, April 1996.

Abstract: Fast, accurate imaging of complex, oil-bearing geologies, such as overthrusts and salt domes, is the key to reducing the costs of domestic oil and gas exploration. Geophysicists say that the known oil reserves in the Gulf of Mexico could be significantly increased if accurate seismic imaging beneath salt domes was possible. A range of techniques exist for imaging these regions, but the highly accurate techniques involve the solution of the wave equation and are characterized by large data sets and large computational demands. Massively parallel computers can provide the computational power for these highly accurate imaging techniques.

A brief introduction to seismic processing will be presented, and the implementation of a seismic-imaging code for distributed memory computers will be discussed. The portable code, Salvo, performs a wave-equation-based, 3-D, prestack, depth imaging and currently runs on the Intel Paragon, the Cray T3D and SGI Challenge series. It uses MPI for portability, and has sustained 22 Mflops/sec/proc (compiled FORTRAN) on the Intel Paragon.

Keywords: multiprocessor application, scientific computing, seismic data processing, parallel I/O, pario-bib

Comment: 2 pages about their I/O scheme, mostly regarding a calculation of the optimal balance between compute nodes and I/O nodes to achieve perfect overlap.

ober:seismic2:
Curtis Ober, Ron Oldfield, David Womble, John VanDyke, and Sudip Dosanjh. Seismic imaging on massively parallel computers. In Proceedings of the 1996 Simulations Multiconference, April 1996.

Keywords: parallel application, scientific computing, seismic data processing, parallel I/O, pario-bib, oldfield

ober:seismic3:
Curtis Ober, Ron Oldfield, David Womble, L. Romero, and Charles Burch. Practical aspects of prestack depth migration with finite differences. In Proceedings of the 67th Annual International Meeting of the Society of Exploration Geophysicists, pages 1758-1761, Dallas Texas, November 1997. Expanded Abstracts.

Keywords: parallel application, scientific computing, seismic data processing, parallel I/O, pario-bib, oldfield

oed:t3d:
Wilfried Oed. The Cray Research massively parallel processor system CRAY T3D. Technical report, Cray Research GmbH, München, Germany, November 15 1993.

Keywords: parallel architecture, shared memory, supercomputer, parallel I/O, pario-bib

Comment: A MIMD, shared-memory machine, with 2-processor units embedded in a 3-d torus. Each link is bidirectional and runs 300 MB/s. Processors are 150 MHz ALPHA, plus 16-64 MB RAM, plus a memory interface unit. Global physical address space with remote-reference and block-transfer capability. Not clear about cache coherency. Separate tree network for global synchronization. Support for message send and optional interrupt. I/O is all done through interface nodes that hook to the YMP host and to its I/O clusters with 400 MB/s links. I/O is by default serialized, but they do support a ``broadcast'' read operation (but see pase:t3d-fortran). FORTRAN compiler supports the NUMA shared memory; PVM is used for C and message passing.

ogata:diskarray:
Mikito Ogata and Michael J. Flynn. A queueing analysis for disk array systems. Technical Report CSL-TR-90-443, Stanford University, 1990.

Keywords: disk array, performance analysis, pario-bib

Comment: Fairly complex analysis of a multiprocessor attached to a disk array system through a central server that is the buffer. Assumes task-oriented model for parallel system, where tasks can be assigned to any CPU; this makes for an easy model. Like Reddy, they compare declustering and striping (they call them striped and synchronized disks).

okeefe:fibre:
Matthew T. O'Keefe. Shared file systems and Fibre Channel. In Proceedings of the Sixth NASA Goddard Conference on Mass Storage Systems and Technologies, pages 1-16, College Park, MD, March 1998. IEEE Computer Society Press.

Keywords: distributed file system, data storage, mass storage, network-attached disks, Fibre Channel, pario-bib

Comment: position paper

oldfield:app-pario:
Ron Oldfield and David Kotz. Applications of parallel I/O. Technical Report PCS-TR98-337, Dept. of Computer Science, Dartmouth College, August 1998. Supplement to PCS-TR96-297.
See also earlier version kotz:app-pario.
See also later version oldfield:bapp-pario.

Abstract: Scientific applications are increasingly being implemented on massively parallel supercomputers. Many of these applications have intense I/O demands, as well as massive computational requirements. This paper is essentially an annotated bibliography of papers and other sources of information about scientific applications using parallel I/O. It will be updated periodically.

Keywords: parallel I/O application, file access patterns, pario-bib, dfk

oldfield:armada:
Ron Oldfield and David Kotz. Armada: A parallel file system for computational grids. In Proceedings of the First IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 194-201, Brisbane, Australia, May 2001. IEEE Computer Society Press.

Abstract: High-performance distributed computing appears to be shifting away from tightly-connected supercomputers to computational grids composed of heterogeneous systems of networks, computers, storage devices, and various other devices that collectively act as a single geographically distributed virtual computer. One of the great challenges for this environment is providing efficient parallel data access to remote distributed datasets. In this paper, we discuss some of the issues associated with parallel I/O and computatational grids and describe the design of a flexible parallel file system that allows the application to control the behavior and functionality of virtually all aspects of the file system.

Keywords: parallel I/O, Grid, parallel file system, pario-bib

Comment: Named one of two "best" papers in the Grid category.

oldfield:bapp-pario:
Ron Oldfield and David Kotz. Scientific applications using parallel I/O. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 45, pages 655-666. IEEE Computer Society Press and John Wiley & Sons, 2001.
See also earlier version oldfield:app-pario.

Abstract: Scientific applications are increasingly being implemented on massively parallel supercomputers. Many of these applications have intense I/O demands, as well as massive computational requirements. This paper is essentially an annotated bibliography of papers and other sources of information about scientific applications using parallel I/O.

Keywords: parallel I/O application, file access patterns, pario-bib, dfk

Comment: Part of jin:io-book.

oldfield:emulab-tr:
Ron Oldfield and David Kotz. Using the Emulab network testbed to evaluate the Armada I/O framework for computational grids. Technical Report TR2002-433, Dept. of Computer Science, Dartmouth College, Hanover, NH, September 2002.

Abstract: This short report describes our experiences using the Emulab network testbed at the University of Utah to test performance of the Armada framework for parallel I/O on computational grids.

Keywords: emulab, network emulation, Armada, performance, dfk, pario-bib

oldfield:restruct:
Ron Oldfield and David Kotz. Improving data access for computational grid applications. Cluster Computing, The Journal of Networks, Software Tools and Applications, 2004. Accepted for publication.

Abstract: High-performance computing increasingly occurs on ``computational grids'' composed of heterogeneous and geographically distributed systems of computers, networks, and storage devices that collectively act as a single ``virtual'' computer. A key challenge in this environment is to provide efficient access to data distributed across remote data servers. Our parallel I/O framework, called Armada, allows application and data-set providers to flexibly compose graphs of processing modules that describe the distribution, application interfaces, and processing required of the dataset before computation. Although the framework provides a simple programming model for the application programmer and the data-set provider, the resulting graph may contain bottlenecks that prevent efficient data access. In this paper, we present an algorithm used to restructure Armada graphs that distributes computation and data flow to improve performance in the context of a wide-area computational grid.

Keywords: parallel I/O, Grid computing, distributed computing, graph algorithms, pario-bib

oldfield:seismic:
Ron A. Oldfield, David E. Womble, and Curtis C. Ober. Efficient parallel I/O in seismic imaging. The International Journal of High Performance Computing Applications, 12(3):333-344, Fall 1998.

Abstract: While high performance computers tend to be measured by their processor and communications speeds, the bottleneck for many large-scale applications is the I/O performance rather than the computational or communication performance. One such application is the processing of 3D seismic data. Seismic data sets, consisting of recorded pressure waves, can be very large, sometimes more than a terabyte in size. Even if the computations can be performed in-core, the time required to read the initial seismic data and velocity model and write images is substantial. This paper will discuss our approach in handling the massive I/O requirements of seismic processing and show the performance of our imaging code (Salvo) on the Intel Paragon.

Keywords: parallel I/O application, pario-bib

Comment: In a Special Issue on I/O in Parallel Applications, volume 12, numbers 3 and 4.

oldfield:thesis:
Ron Oldfield. Efficient I/O for Computational Grid Applications. PhD thesis, Dept. of Computer Science, Dartmouth College, May 2003. Available as Dartmouth Computer Science Technical Report TR2003-459.
See also later version oldfield:thesis-tr.

Abstract: High-performance computing increasingly occurs on "computational grids" composed of heterogeneous and geographically distributed systems of computers, networks, and storage devices that collectively act as a single "virtual" computer. A key challenge in this environment is to provide efficient access to data distributed across remote data servers. This dissertation explores some of the issues associated with I/O for wide-area distributed computing and describes an I/O system, called Armada, with the following features: a framework to allow application and dataset providers to flexibly compose graphs of processing modules that describe the distribution, application interfaces, and processing required of the dataset before or after computation; an algorithm to restructure application graphs to increase parallelism and to improve network performance in a wide-area network; and a hierarchical graph-partitioning scheme that deploys components of the application graph in a way that is both beneficial to the application and sensitive to the administrative policies of the different administrative domains. Experiments show that applications using Armada perform well in both low- and high-bandwidth environments, and that our approach does an exceptional job of hiding the network latency inherent in grid computing.

Keywords: parallel I/O, Grid computing, pario-bib

oldfield:thesis-tr:
Ron Oldfield. Efficient I/O for computational grid applications. Technical Report TR2003-459, Dept. of Computer Science, Dartmouth College, May 2003.
See also earlier version oldfield:thesis.

Abstract: High-performance computing increasingly occurs on "computational grids" composed of heterogeneous and geographically distributed systems of computers, networks, and storage devices that collectively act as a single "virtual" computer. A key challenge in this environment is to provide efficient access to data distributed across remote data servers. This dissertation explores some of the issues associated with I/O for wide-area distributed computing and describes an I/O system, called Armada, with the following features: a framework to allow application and dataset providers to flexibly compose graphs of processing modules that describe the distribution, application interfaces, and processing required of the dataset before or after computation; an algorithm to restructure application graphs to increase parallelism and to improve network performance in a wide-area network; and a hierarchical graph-partitioning scheme that deploys components of the application graph in a way that is both beneficial to the application and sensitive to the administrative policies of the different administrative domains. Experiments show that applications using Armada perform well in both low- and high-bandwidth environments, and that our approach does an exceptional job of hiding the network latency inherent in grid computing.

Keywords: parallel I/O, Grid computing, pario-bib

olson:random:
Thomas M. Olson. Disk array performance in a random I/O environment. Computer Architecture News, 17(5):71-77, September 1989.

Keywords: I/O benchmark, transaction processing, pario-bib

Comment: See wolman:iobench. Used IOBENCH to compare normal disk configuration with striped disks, RAID level 1, and RAID level 5, under a random I/O workload. Multiple disks with files on different disks gave good performance (high throughput and low response time) when multiple users. Striping ensures balanced load, similar performance. RAID level 1 or level 5 ensures reliability at performance cost over striping, but still good. Especially sensitive to write/read ratio - performance lost for large number of writes.

oyang:m2io:
Yen-Jen Oyang. Architecture, operating system, and I/O subsystem design of the $M^2$ database machine. In Proceedings of the Parallel Systems Fair at the International Parallel Processing Symposium, pages 31-38, 1993.

Keywords: parallel I/O, multiprocessor file system, parallel database, pario-bib

Comment: A custom multiprocessor with a shared-memory clusters networked together and to shared disks. Runs Mach. Directory-based coherence protocol for the distributed file system. Background writeback.

pahuja:dpio:
Neena Pahuja and Gautam M. Shroff. A data parallel i/o library for workstation networks. In Proceedings of the 1995 International Conference on High Performance Computing, pages 423-428, New Delhi, India, December 1995.

Keywords: disk array, multimedia, parallel I/O, pario-bib

paleczny:support:
Michael Paleczny, Ken Kennedy, and Charles Koelbel. Compiler support for out-of-core arrays on data parallel machines. In Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pages 110-118, McLean, VA, February 1995.

Keywords: compilers, parallel I/O, out-of-core applications, pario-bib

Comment: They are developing extensions to the FortranD compiler so that it supports I/O-related directives for out-of-core computations. The compiler then analyzes the computation, inserts the necessary I/O calls, and optimizes the I/O. They hand-compile a red-black relaxation program and an LU-factorization program. I/O was much faster than VM, particularly because they were able to make large requests rather than faulting on individual pages. Overlapping I/O and computation was also a big win. See also kennedy:sio, bordawekar:model.

panasas:architecture:
Object-based storage architecture: Defining a new generation of storage systems built on distributed, intelligent storage devices. Panasas Inc. white paper, version 1.0, October 2003. http://www.panasas.com/docs/.

Keywords: object-based storage, distributed file system, parallel file system, pario-bib

Comment: The paper describes the architecture of proprietary object-based storage system for clusters-an extension of Garth Gibson's NASD work at CMU (see gibson:nasd-tr). Similar to Lustre (cfs:lustre, braam:lustre-arch).

panfilov:raid5:
Oleg A. Panfilov. Performance analysis of RAID-5 disk arrays. In Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences, volume I, pages 49-60, January 1995.

Keywords: RAID, disk array, parallel I/O, pario-bib

papadopouli:vbr-streams:
Maria Papadopouli and Leana Golubchik. Support of VBR video streams under disk bandwidth limitations. ACM SIGMETRICS Performance Evaluation Review, 25(3):13-20, December 1997.

Keywords: multimedia, video on demand, parallel I/O, pario-bib

Comment: Part of a special issue on parallel and distributed I/O.

park:2disk:
Chan-Ik Park. Efficient placement of parity and data to tolerate two disk failures in disk array systems. IEEE Transactions on Parallel and Distributed Systems, 6(11):1177-1184, November 1995.

Abstract: In this paper, we deal with the data/parity placement problem which is described as follows: how to place data and parity evenly across disks in order to tolerate two disk failures, given the number of disks N and the redundancy rate p which represents the amount of disk spaces to store parity information. To begin with, we transform the data/parity placement problem into the problem of constructing an N x N matrix such that the matrix will correspond to a solution to the problem. The method to construct a matrix has been proposed and we have shown how our method works through several illustrative examples. It is also shown that any matrix constructed by our proposed method can be mapped into a solution to the placement problem if a certain condition holds between N and p where N is the number of disks and p is a redundancy rate.

Keywords: parallel I/O, disk array, reliability, fault tolerance, pario-bib

park:interface:
Yoonho Park, Ridgway Scott, and Stuart Sechrest. Virtual memory versus file interfaces for large, memory-intensive scientific applications. In Proceedings of Supercomputing '96. ACM Press and IEEE Computer Society Press, November 1996. Also available as UH Department of Computer Science Research Report UH-CH-96-7.

Abstract: Scientific applications often require some strategy for temporary data storage to do the largest possible simulations. The use of virtual memory for temporary data storage has received criticism because of performance problems. However, modern virtual memory found in recent operating systems such as Cenju-3/DE give application writers control over virtual memory policies. We demonstrate that custom virtual memory policies can dramatically reduce virtual memory overhead and allow applications to run out-of-core efficiently. We also demonstrate that the main advantage of virtual memory, namely programming simplicity, is not lost.

Keywords: virtual memory, file interface, scientific applications, out-of-core, parallel I/O, pario-bib

Comment: Web and CDROM only. They advocate the use of traditional demand-paged virtual memory systems in supporting out-of-core applications. They are implementing an operating system for the NEC Cenju-3/DE, a shared-nothing MIMD multiprocessor with a multistage interconnection network and disks on every node. The operating system is based on Mach, and they have extended Mach to allow user-provided [local] replacement policies. Basically, they argue that you can get good performance as long as you write your own replacement policy (even OPT is possible in certain applications), and that this is easier than (re)writing the application with explicit out-of-core file I/O calls. They measure the performance of two applications on their system, with OPT, FIFO, and a new replacement algorithm customized to one of the applications. They show that they can get much better performance with some replacement policies than with others, but despite the paper's title they do not compare with the performance of an equivalent program using file I/O.

park:pario:
Arvin Park and K. Balasubramanian. Providing fault tolerance in parallel secondary storage systems. Technical Report CS-TR-057-86, Department of Computer Science, Princeton University, November 1986.

Keywords: parallel I/O, reliability, RAID, pario-bib

Comment: They use ECC with one or more parity drives in bit-interleaved systems, and on-line regeneration of failed drives from spares. More cost-effective than mirrored disks. One of the earliest references to RAID-like concepts. Basically, they describe RAID3.

parsons:complex:
Ian Parsons, Jonathan Schaeffer, Duane Szafron, and Ron Unrau. Using PI/OT to support complex parallel I/O. In Proceedings of the Joint International Parallel Processing Symposium and IEEE Symposium on Parallel and Distributed Processing, pages 285-291. IEEE Computer Society Press, March 1998.

Keywords: parallel I/O, pario-bib

parsons:templates:
Ian Parsons, Ron Unrau, Jonathan Schaeffer, and Duane Szafron. PI/OT: Parallel I/O templates. Parallel Computing, 23(4):543-570, June 1997.

Abstract: This paper presents a novel, top-down, high-level approach to parallelizing file I/O. Each parallel file descriptor is annotated with a high-level specification, or template, of the expected parallel behaviour. The annotations are external to and independent of the source code. At run-time, all I/O using a parallel file descriptor adheres to the semantics of the selected template. By separating the parallel I/O specifications from the code, a user can quickly change the I/O behaviour without rewriting code. Templates can be composed hierarchically to construct complex access patterns.

Two sample parallel programs using these templates are compared against versions implemented in an existing parallel I/O system (PIOUS). The sample programs show that the use of parallel I/O templates are beneficial from both the performance and software engineering points of view.

Keywords: parallel programming, parallel I/O, pario-bib

Comment: An interesting approach in which they try to separate the description of the parallelism in a file's access from the sequential programming used to access the file. Seems like a good idea. It seems to assume that the programmer was porting an existing sequential code, or prefers to write their parallel program with a sequential frame of mind, including the existing fopen/fread/fwrite stdio interface. They retain the traditional stream-of-bytes file structure. See also parsons:complex.

pase:t3d-fortran:
Douglas M. Pase, Tom MacDonald, and Andrew Meltzer. MPP Fortran programming model. Technical report, Cray Research, Inc., October 11 1993.

Abstract: This report describes the MPP Fortran programming model which will be supported on the first phase MPP systems. Based on existing and proposed standards, it is a work sharing model which combines features from existing models in a way that may be both efficiently implemented and useful.

Keywords: compiler, parallel language, supercomputing, parallel I/O, pario-bib

Comment: See also oed:t3d for T3D overview. I only read the part about I/O. The only I/O support, apparently, is for each processor to open and access the file independently from all other processors.

pasquale:characterization:
Barbara K. Pasquale and George C. Polyzos. Dynamic I/O characterization of I/O intensive scientific applications. In Proceedings of Supercomputing '94, pages 660-669, 1994.

Abstract: Understanding the characteristic I/O behavior of scientific applications is an integral part of the research and development efforts for the improvement of high performance I/O systems. This study focuses on application level I/O behavior with respect to both static and dynamic characteristics. We observed the San Diego Supercomputer Center's Cray C90 workload and isolated the most I/O intensive applications. The combination of a low-level description of physical resource usage and the high-level functional composition of applications and scientific disciplines for this set reveals the major sources of I/O demand in the workload. We selected two applications from the I/O intensive set and performed a detailed analysis of their dynamic I/O behavior. These applications exhibited a high degree of regularity in their I/O activity over time and their characteristic I/O behaviors can be precisely described by one and two, respectively, recurring sequences of data accesses and computation periods.

Keywords: parallel I/O, pario-bib

pasquale:dynamic:
Barbara K. Pasquale and George C. Polyzos. Dynamic I/O characterization of I/O intensive scientific applications. In Proceedings of Supercomputing '94, pages 660-669, Washington, DC, November 1994. IEEE Computer Society Press.

Keywords: scientific computing, file access patterns, I/O, pario-bib

Comment: This paper extends some of their previous results, but the real bottom line here is that some scientific applications do a lot of I/O, the I/O us bursty, and the pattern of bursts is cyclic and regular. Seems like this cyclic nature could be a source of some optimization. Included in the parallel I/O bibliography because it is useful to that community, though they did not trace parallel workload.

pasquale:iowork:
Barbara K. Pasquale and George C. Polyzos. A static analysis of I/O characteristics of scientific applications in a production workload. In Proceedings of Supercomputing '93, pages 388-397, Portland, OR, 1993. IEEE Computer Society Press.

Keywords: scientific computing, file access patterns, pario-bib

Comment: Analyzed one month of accounting records from Cray YMP8/864 in SDSC's production environment. Their base assumption is that scientific application I/O is regular and predictable, eg, repetitive periodic bursts, with distinct phases, repeating patterns, and sequential access. The goal is to characterize a set of I/O-intensive scientific applications and evaluate regularity of resource usage. They measure volumes and rates of applications and total system. Cumulative and average usage for each distinct non-system application. Most resource usage came from the 5% of applications that were not system applications. ``Virtual I/O rate'' is the bytes transferred per CPU second, which is IMHO only a rough measure because sometimes I/O overlaps CPU time, and sometimes does not. They picked out long-running applications with a high virtual I/O rate. Top 50 applications had 71% of bytes transferred and 10% of CPU time. Of those, 4.66 MB/sec min, 131 MB/sec max. Of those they picked the ones executed most often. Cluster analysis showed only 1-2 clusters. Correlation between I/O and CPU time. Included in the parallel I/O bibliography because it is useful to that community, though they did not trace parallel workload.

pathforward-fs:
Statement of work: SGS file system. ASCI PathForward Program: {DOE} National Nuclear Security Administration & the {DOD} National Security Agency, April 2001.

Keywords: design, parallel file system, parallel I/O, pario-bib

Comment: Describes the requirements and desired performance features of a parallel file system designed for the DOE ASCI computers.

patt:iosubsystem:
Yale N. Patt. The I/O subsystem: a candidate for improvement. IEEE Computer, 27(3):15-16, March 1994.

Keywords: I/O, file system, parallel I/O, pario-bib

Comment: This is the intro to a special issue on I/O.

patterson:binformed:
R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. Informed prefetching and caching. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 16, pages 224-244. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version patterson:informed.

Keywords: caching, prefetching, file system, hints, I/O, resource management, parallel I/O, pario-bib

Comment: Part of jin:io-book; reformatted version of patterson:informed.

patterson:braid:
David Patterson, Garth Gibson, and Randy Katz. A case for redundant arrays of inexpensive disks (RAID). In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 1, pages 3-14. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version patterson:raid.

Keywords: parallel I/O, RAID, reliability, cost analysis, I/O bottleneck, disk array, pario-bib

Comment: Part of jin:io-book; reformatted version of patterson:raid.

patterson:informed:
R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. Informed prefetching and caching. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, pages 79-95, Copper Mountain, CO, December 1995. ACM Press.
See also earlier version patterson:informed-tr.
See also later version patterson:binformed.

Abstract: In this paper, we present aggressive, proactive mechanisms that tailor file system resource management to the needs of I/O-intensive applications. In particular, we show how to use application-disclosed access patterns (hints) to expose and exploit I/O parallelism, and to dynamically allocate file buffers among three competing demands: prefetching hinted blocks, caching hinted blocks for reuse, and caching recently used data for unhinted accesses. Our approach estimates the impact of alternative buffer allocations on application execution time and applies cost-benefit analysis to allocate buffers where they will have the greatest impact. We have implemented informed prefetching and caching in Digitals OSF/1 operating system and measured its performance on a 150 MHz Alpha equipped with 15 disks running a range of applications. Informed prefetching reduces the execution time of text search, scientific visualization, relational database queries, speech recognition, and object linking by 20-83%. Informed caching reduces the execution time of computational physics by up to 42% and contributes to the performance improvement of the object linker and the database. Moreover, applied to multiprogrammed, I/O-intensive workloads, informed prefetching and caching increase overall throughput.

Keywords: caching, prefetching, file system, hints, I/O, resource management, parallel I/O, pario-bib

Comment: See patterson:informed-tr for an earlier version. Programs may give hints to the file system about what they will read in the future, and in what order. Hints are used for informed prefetching and informed caching. Most interesting thing about this paper over the earlier ones is the buffer management. Prefetcher and demand fetcher both want buffers. LRU cache and hinted cache both could supply buffers (thru replacement). Each supplies a cost for giving up buffers and benefit for getting more buffers. These are expressed in a common 'currency', in terms of their expected effect on I/O service time, and a manager takes buffers from one and gives buffers to another when the benefits outweigh the costs. All is based on a simple model, which is further simplified in their implementation within OSF/1. Performance looks good, they can keep more disks busy in a parallel file system. Furthermore, informed caching helps reduce the number of I/Os. Indeed they 'discover' MRU replacement policy automatically.

patterson:informed-tr:
R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. Informed prefetching and caching. Technical Report CMU-CS-95-134, School of Computer Science, Carnegie Mellon University, 1995.
See also later version patterson:informed.

Abstract: The underutilization of disk parallelism and file cache buffers by traditional file systems induces I/O stall time that degrades the performance of modern microprocessor-based systems. In this paper, we present aggressive mechanisms that tailor file system resource management to the needs of I/O-intensive applications. In particular, we show how to use application-disclosed access patterns (hints) to expose and exploit I/O parallelism, and to dynamically allocate file buffers among three competing demands: prefetching hinted blocks, caching hinted blocks for reuse, and caching recently used data for unhinted accesses. Our approach estimates the impact of alternative buffer allocations on application execution time and applies a cost-benefit analysis to allocate buffers where they will have the greatest impact. We implemented informed prefetching and caching in DEC's OSF/1 operating system and measured its performance on a 150 MHz Alpha equipped with 15 disks. When running a range of applications including text search, 3D scientific visualization, relational database queries, speech recognition, and computational chemistry, informed prefetching reduces the execution time of four of these applications by 20 to 87%. Informed caching reduces the execution time of the fifth application by up to 30%.

Keywords: caching, prefetching, file system, hints, I/O, resource management, parallel I/O, pario-bib

patterson:latency:
R. H. Patterson, G. A. Gibson, and M. Satyanarayanan. Using transparent informed prefetching to reduce file read latency. In Proceedings of the 1992 NASA Goddard Conference on Mass Storage Systems and Technologies, pages 329-342, September 1992.
See also later version patterson:informed.

Keywords: parallel I/O, file prefetching, file caching, pario-bib

Comment: This 'paper' is really an annotated set of slides.

patterson:pdis-tip:
R. Hugo Patterson and Garth A. Gibson. Exposing I/O concurrency with informed prefetching. In Proceedings of the Third International Conference on Parallel and Distributed Information Systems, pages 7-16, September 1994.
See also later version patterson:informed.

Abstract: Informed prefetching provides a simple mechanism for I/O-intensive, cache-ineffective applications to efficiently exploit highly-parallel I/O subsystems such as disk arrays. This mechanism, dynamic disclosure of future accesses, yields substantial benefits over sequential readahead mechanisms found in current file systems for non-sequen tial workloads. This paper reports the performance of the Transparent Informed Prefetching system (TIP), a minimal prototype implemented in a Mach 3.0 system with up to four disks. We measured reductions by factors of up to 1.9 and 3.7 in the execution time of two example applications: multi-file text search and scientific data visualization.

Keywords: prefetching, parallel I/O, pario-bib

Comment: Also available in HTML format at http://www.cs.cmu.edu/Web/Groups/PDL/HTML-Papers/PDIS94/final.fm.html.

patterson:raid:
David Patterson, Garth Gibson, and Randy Katz. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 109-116, Chicago, IL, June 1988. ACM Press.
See also later version patterson:braid.

Keywords: parallel I/O, RAID, reliability, cost analysis, I/O bottleneck, disk array, pario-bib

Comment: Make a good case for the upcoming I/O crisis, compare single large expensive disks (SLED) with small cheap disks. Outline five levels of RAID the give different reliabilities, costs, and performances. Block-interleaved with a single check disk (level 4) or with check blocks interspersed (level 5) seem to give best performance for supercomputer I/O or database I/O or both. Note: the TR by the same name (UCB/CSD 87/391) is essentially identical.

patterson:raid2:
David Patterson, Peter Chen, Garth Gibson, and Randy H. Katz. Introduction to redundant arrays of inexpensive disks (RAID). In Proceedings of IEEE Compcon, pages 112-117, Spring 1989.
See also earlier version patterson:raid.

Keywords: parallel I/O, RAID, reliability, cost analysis, I/O bottleneck, disk array, pario-bib

Comment: A short version of patterson:raid, with some slight updates.

patterson:snapmirror:
R. Hugo Patterson, Stephen Manley, Mike Federwisch, Dave Hitz, Steve Kleiman, and Shane Owara. SnapMirror: File-system-based asynchronous mirroring for disaster recovery. In Proceedings of the USENIX FAST '02 Conference on File and Storage Technologies, pages 117-130, Monterey, CA, January 2002. USENIX Association.

Abstract: Computerized data has become critical to the survival of an enterprise. Companies must have a strategy for recovering their data should a disaster such as a fire destroy the primary data center. Current mechanisms offer data managers a stark choice: rely on affordable tape but risk the loss of a full day of data and face many hours or even days to recover, or have the benefits of a fully synchronized on-line remote mirror, but pay steep costs in both write latency and network bandwidth to maintain the mirror. In this paper, we argue that asynchronous mirroring, in which batches of updates are periodically sent to the remote mirror, can let data managers find a balance between these extremes. First, by eliminating the write latency issue, asynchrony greatly reduces the performance cost of a remote mirror. Second, by storing up batches of writes, asynchronous mirroring can avoid sending deleted or overwritten data and thereby reduce network bandwidth requirements. Data managers can tune the update frequency to trade network bandwidth against the potential loss of more data. We present Snap-Mirror, an asynchronous mirroring technology that leverages file system snapshots to ensure the consistency of the remote mirror and optimize data transfer. We use traces of production filers to show that even updating an asynchronous mirror every 15 minutes can reduce data transferred by 30{PCT} to 80{PCT}. We find that exploiting file system knowledge of deletions is critical to achieving any reduction for no-overwrite file systems such as WAFL and LFS. Experiments on a running system show that using file system metadata can reduce the time to identify changed blocks from minutes to seconds compared to purely logical approaches. Finally, we show that using SnapMirror to update every 30 minutes increases the response time of a heavily loaded system only 22{PCT}. dollars depending on the size of the enterprise and the role of the data. With increasing frequency, companies are instituting disaster recovery plans to ensure appropriate data availability in the event of a catastrophic failure or disaster that destroys a site (e.g. flood, fire, or earthquake). It is relatively easy to provide redundant server and storage hardware to protect against the loss of physical resources. Without the data, however, the redundant hardware is of little use.

Keywords: file systems, pario-bib

patterson:tip:
R. Hugo Patterson, Garth A. Gibson, and M. Satyanarayanan. A status report on research in transparent informed prefetching. ACM Operating Systems Review, 27(2):21-34, April 1993.
See also later version patterson:informed.

Abstract: This paper focuses on extending the power of caching and prefetching to reduce file read latencies by exploiting application level hints about future I/O accesses. We argue that systems that disclose high-level knowledge can transfer optimization information across module boundaries in a manner consistent with sound software engineering principles. Such Transparent Informed Prefetching (TIP) systems provide a technique for converting the high through put of new technologies such as disk arrays and log-structured file systems into low latency for applications. Our preliminary experiments show that even without a high-throughput I/O sub system TIP yields reduced execution time of up to 30 for applications obtaining data from a remote file server and up to 13% for applications obtaining data from a single local disk. These experiments indicate that greater performance benefits will be available when TIP is integrated with low level resource management policies and highly parallel I/O subsystems such as disk arrays.

Keywords: file system, prefetching, operating system, pario-bib

Comment: Not much new over previous TIP papers, but does have newer numbers. See patterson:tip1. Also appears in DAGS'93 (patterson:tip2). Previously appeared as TR CMU-CS-93-1.

patterson:tip2:
R. Hugo Patterson, Garth A. Gibson, and M. Satyanarayanan. Informed prefetching: Converting high throughput to low latency. In Proceedings of the 1993 DAGS/PC Symposium, pages 41-55, Hanover, NH, June 1993. Dartmouth Institute for Advanced Graduate Studies.
See also later version patterson:informed.

Abstract: This paper focuses on extending the power of caching and prefetching to reduce file read latencies by exploiting application level hints about future I/O accesses. We argue that systems that disclose high-level knowledge can transfer optimization information across module boundaries in a manner consistent with sound software engineering principles. Such Transparent Informed Prefetching (TIP) systems provide a technique for converting the high throughput of new technologies such as disk arrays and log-structured file systems into low latency for applications. Our preliminary experiments show that even without a high-throughput I/O sub-system TIP yields reduced execution time of up to 30% for applications obtaining data from a remote file server and up to 13% for applications obtaining data from a single local disk. These experiments indicate that greater performance benefits will be available when TIP is integrated with low level resource management policies and highly parallel I/O subsystems such as disk arrays.

Keywords: file system, prefetching, operating system, pario-bib

Comment: Invited speaker: Garth Gibson. Similar paper appeared in ACM OSR April 1993 (patterson:tip)

patterson:vterabytes:
David Patterson. Terabytes $\gg$ teraflops (or why work on processors when I/O is where the action is?). Produced by University Video Communications, 1993. Videotape.

Abstract: RISC pioneer and UC, Berkeley Computer Science Professor David Patterson is working to develop input/output systems to match the increasingly higher performance of new processors. Here he describes the results of the RAID (Redundant Arrays of Inexpensive Disks) project, which offers much greater performance, capacity, and reliability from I/O systems. Patterson also discusses a new project, Sequoia 2000, which looks at utilizing small helical scan tapes, such as digital-audiotapes or videotapes, to offer terabytes of storage for the price of a file/server. He believes that a 1000x increase in storage, available on most Ethernets, will have a much greater impact than a 1000x increase in processing speed.

Keywords: videotape, computer architecture, parallel I/O, pario-bib

Comment: See patterson:trends. 58 minutes.

pawlowski:parsort:
Markus Pawlowski and Rudolf Bayer. Parallel sorting of large data volumes on distributed memory multiprocessors. In Parallel Computer Architectures: Theory, Hardware, Software, Applications, volume 732 of Lecture Notes in Computer Science, pages 246-264, Berlin, 1993. Springer-Verlag.

Keywords: sorting, parallel I/O algorithm, pario-bib

Comment: Main contribution appears to be a new sampling method for initial partition of data set. They approach it from a database point of view.

pearson:sorting:
Matthew D. Pearson. Fast out-of-core sorting on parallel disk systems. Technical Report PCS-TR99-351, Dept. of Computer Science, Dartmouth College, Hanover, NH, June 1999.

Abstract: This paper discusses our implementation of Rajasekaran's (l,m)-mergesort algorithm (LMM) for sorting on parallel disks. LMM is asymptotically optimal for large problems and has the additional advantage of a low constant in its I/O complexity. Our implementation is written in C using the ViC* I/O API for parallel disk systems. We compare the performance of LMM to that of the C library function qsort on a DEC Alpha server. qsort makes a good benchmark because it is fast and performs comparatively well under demand paging. Since qsort fails when the swap disk fills up, we can only compare these algorithms on a limited range of inputs. Still, on most out-of-core problems, our implementation of LMM runs between 1.5 and 1.9 times faster than qsort, with the gap widening with increasing problem size.

Keywords: parallel I/O, out of core, sorting, parallel algorithm, pario-bib

Comment: Undergraduate Honors Thesis. Advisor: Tom Cormen.

perez:allocation:
Jose Maria Perez, Felix Garcia, Jesus Carretero, Alejandro Calderon, and Luis Miguel Sanchez. Data allocation and load balancing for heterogeneous cluster storage systems. In Workshop on Parallel I/O in Cluster Computing and Computational Grids, pages 718-723, Tokyo, May 2003. IEEE Computer Society Press. Organized at the IEEE/ACM International Symposium on Cluster Computing and the Grid 2003.

Abstract: Distributed filesystems are a typical solution in networked environments as clusters and grids. Parallel filesystems are a typical solution in order to reach high performance I/O distributed environment, but those filesystems have some limitations in heterogeneous storage systems. Usually in distributed systems, load balancing is used as a solution to improve the performance, but typically the distribution is made between peer-to-peer computational resources and from the processor point of view. In heterogeneous systems, like heterogeneous clusters of workstations, the existing solutions do not work so well. However, the utilization of those systems is more extended every day, having an extreme example in the grid environment. In this paper we bring attention to those aspects of heterogeneous distributed data systems presenting a parallel file system that take into account heterogeneity of storage nodes, the dynamic addition of new storage nodes, and an algorithm to group requests in heterogeneous systems.

Keywords: parallel I/O, load balancing, pario-bib

perez:apriori:
M. S. Pérez, R. A. Pons, F. Garc\'\ia, J. Carretero, and M. L. Córdoba. An optimization of apriori algorithm through the usage of parallel I/O and hints. In Rough Sets and Current Trends in Computing, number 2475 in Lecture Notes in Computer Science, pages 449-452. Springer-Verlag, October 2002.

Abstract: Association rules are very useful and interesting patterns in many data mining scenarios. Apriori algorithm is the best- known association rule algorithm. This algorithm interacts with a storage system in order to access input data and output the results. This paper shows how to optimize this algorithm adapting the underlying storage system to this problem through the usage of hints and parallel features.

Keywords: parallel I/O, pario-bib

perez:clfs:
F. Pérez, J. Carretero, P. de Miguel, F. Garc\'\ia, and L. Alonso. CLFS design: A parallel file manager for multicomputers. Technical Report FIM/82.1/DATSI/94, Universidad Politecnic Madrid, Madrid, Spain, 1994.

Abstract: This document describes the detailed design of the CLFS, one of the components of the Cache Coherent File System (CCFS). CCFS has three main components: Client File Server (CLFS), Local File Server (LFS), Concurrent Disk System (CDS). The Client File Servers are located on each processing node, to develop file manager functions in a per node basis. The CLFS will interact with the LFSs to provide block services, naming, locking, real input/output and to manage the disk system, partitions, distributed partitions, etc. The CLFS includes a standard POSIX interface (internally parallelized) and some parallel extensions It will be responsible of maintaining cache consistency, distributing accesses to servers, providing a file system interface to the user, etc.

Keywords: multiprocessor file system, parallel I/O, pario-bib

Comment: See carretero:*, rosales:cds, perez:clfs.

perez:cooperation:
María S. Pérez, Alberto Sánchez, Víctor Robles, José M. Pe\ na, and Jemal Abawajy. Cooperation model of a multiagent parallel file system for clusters. In Proceedings of the Fouth IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 595-601, Chicago, IL, April 2004. IEEE Computer Society Press.

Keywords: multi-agent parallel file system, pario-bib

perez:evaluate:
F. Perez, J. Carretero, L. Alonso, P. De Miguel, and F. Garcia. Evaluating ParFiSys: A high-performance parallel and distributed file system. Journal of Systems Architecture, 43(8):533-542, May 1997.

Abstract: We present an overview of ParFiSys, a coherent parallel file system developed at the UPM to provide I/O services to the GPMIMD machine, an MPP built within the ESPRIT project P-5404. Special emphasis is made on the results obtained during ParFiSys evaluation. They were obtained using several I/O benchmarks (PARKBENCH, IOBENCH, etc.) and several MPP platforms (T800, T9000, etc.) to cover a big spectrum of the ParFiSys features, being specifically oriented to measure throughput for scientific applications I/O patterns. ParFiSys is specially well suited to provide I/O services to scientific applications requiring high I/O bandwidth, to minimize application porting effort, and to exploit the parallelism of generic message-passing multicomputers.

Keywords: parallel I/O, multiprocessor file system, pario-bib

perez:gridexpand:
José M. Pérez, Félix Garcia, Jesús Carretero, Alejandro Calderón, and Javier Fernández. A parallel I/O middleware to integrate heterogeneous storage resources on grids. Lecture Notes in Computer Science, 2970:124-131, March 2004.

Abstract: The philosophy behind grid is to use idle resources to achieve a higher level of computational services (computation, storage, etc). Existing data grids solutions are based in new servers, specific APIs and protocols, however this approach is not a realistic solution for enterprises and universities, because this supposes the deployment of new data servers across the company. This paper describes a new approach to data access in computational grids. This approach is called GridExpand, a parallel I/O middleware that integrates heterogeneous data storage resources in grids. The proposed grid solution integrates available data network solutions (NFS, CIFS, WebDAV) and makes possible the access to a global grid file system. Our solution differs from others because it does not need the installation of new data servers with new protocols. Most of the data grid solutions use replication as the way to obtain high performance. Replication, however, introduce consistency problem for many collaborative applications, and sometimes requires the usage of lots of resources. To obtain high performance, we apply the parallel I/O techniques used in parallel file systems.

Keywords: Data Grids, Parallel I/O, data declustering, High performance I/O, pario-bib

Comment: A short paper describing an adaptation of the Expand parallel file system for data grids. Also see the related paper garcia:expand-design.

perez:hints:
María S. Pérez, Albert Sánchez, V\'\ictor Robles, José Pe\ na, and Fernando Pérez. Optimizations based on hints in a parallel file system. Lecture Notes in Computer Science, 3038:347-354, June 2004.

Abstract: Existing parallel file systems provide applications a little control for optimizing I/O accesses. Most of these systems use optimization techniques transparent to the applications, limiting the performance achieved by these solutions. Furthermore, there is a big gap between the interface provided by parallel file systems and the needs of applications. In fact, most of the parallel file systems do not use intuitive I/O hints or other optimizations approaches. In this sense, applications programmers cannot take advantage of optimization techniques suitable for the application domain. This paper describes I/O optimizations techniques used in MAPFS, a multiagent I/O architecture. These techniques are configured by means of a double interface for specifying access patterns or hints that increase the performance of I/O operations. An example of this interface is shown.

Keywords: parallel I/O, optimizations, caching, prefetching, hints, pario-bib

pfister:infiniband:
Gregory F. Pfister. An introduction to the InfiniBand architecture. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 42, pages 617-632. IEEE Computer Society Press and Wiley, New York, NY, 2001.

Keywords: parallel I/O architecure, pario-bib

Comment: Part of jin:io-book.

philippsen:triton:
Michael Philippsen, Thomas M. Warschko, Walter F. Tichy, and Christian G. Herter. Project Triton: towards improved programmability of parallel machines. In Proceedings of the Twenty-Sixth Annual Hawaii International Conference on System Sciences, volume I, pages 192-201, 1993.

Keywords: parallel programming, parallel architecture, parallel I/O, pario-bib

Comment: A language- and application-driven proposal for parallel architecture, that mixes SIMD and MIMD, high-performance networking, large memory, shared address space, and so forth. Fairly convincing arguments. One disk per node. Little mention of a file system though. Email from student Udo Boehm:``We use in the version of Triton/1 with 256 PE's 72 Disks at the moment (the filesystem is scalable up to 256 Disks). These Disks are divided into 8 Groups with 9 Disks. In each group exists one parity disk. Our implementation of the filesystem is an parallel version of RAID Level 3 with some extensions. We use so called vector files for diskaccess. A file is always distributed over all disks of the diskarray. A vectorfile is divided in logical blocks. A logical block exist of 72 physical blocks, each block is on one of the 72 disks and all these 72 physical blocks have the same blocknumber on each disk. A logical block has 18432 Bytes, where 16384 Bytes are for Data. The filesystem uses these logical blocks to save data. We do not use special PE's for the I/O. All PE's can be (are) used to do I/O ! There exists no central which coordinates the PE's.''

pierce:pario:
Paul Pierce. A concurrent file system for a highly parallel mass storage system. In Proceedings of the Fourth Conference on Hypercube Concurrent Computers and Applications, pages 155-160, Monterey, CA, March 1989. Golden Gate Enterprises, Los Altos, CA.

Keywords: parallel I/O, hypercube, Intel iPSC/2, multiprocessor file system, pario-bib

Comment: Intel iPSC/2 Concurrent File System. Chose to tailor system for high performance for large files, read in large chunks. Uniform logical file system view, Unix stdio interface. Blocks scattered over all disks, but not striped. Blocksize 4K optimizes message-passing performance without using blocks that are too big. Tree-directory is stored in ONE file and managed by ONE process, so opens are bottlenecked, but that is not their emphasis. File headers, however, are scattered. The file header info contains a list of blocks. File header is managed by disk process on its I/O node. Data caching is done only at the I/O node of the originating disk drive. Read-ahead is used but not detailed here.

pinkenburg:tpo++:
Simon Pinkenburg and Wolfgang Rosenstiel. Parallel I/O in an object-oriented message-passing library. Lecture Notes in Computer Science, 3241:251-258, November 2004.

Abstract: The article describes the design and implementation of parallel I/O in the object-oriented message-passing library TPO++. TPO++ is implemented on top of the message passing standard MPI and provides an object-oriented, type-safe and data centric interface to message-passing. Starting with version 2, the MPI standard defines primitives for parallel I/O called MPI-IO. Based on this layer, we have implemented an object-oriented parallel I/O interface in TPO++. The project is part of our efforts to apply object-oriented methods to the development of parallel physical simulations. We give a short introduction to our message-passing library and detail its extension to parallel I/O. Performance measurements between TPO++ and MPI are compared and discussed.

Keywords: object-oriented message passing, TPO++, parallel I/O interface, pario-bib

piriyakumar:enhanced:
Douglas Antony Louis Piriyakumar, Paul Levi, and Rolf Rabenseifner. Enhanced file interoperability with parallel MPI file-I/O in image processing. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 2474 of Lecture Notes in Computer Science, pages 174-? Springer-Verlag, 2002.

Abstract: One of the crucial problems in image processing is Image Matching, i.e., to match two images, or in our case, to match a model with the given image. This problem being highly computation intensive, parallel processing is essential to obtain the solutions in time due to real world constraints. The Hausdorff method is used to locate human beings in images by matching the image with models and is parallelized with MPI. The images are usually stored in files with different formats. As most of the formats can be converted into ASCII file format containing integers, we have implemented 3 strategies namely, Normal File Reading, Off-line Conversion and Run-time Conversion for free format integer file reading and writing. The parallelization strategy is optimized so that I/O overheads are minimal. The relative performances with multiple processors are tabulated for all the cases and discussed. The results obtained demonstrate the efficiency of our strategies and the implementations will enhance the file interoperability which will be useful for image processing community to use parallel systems to meet the real time constraints.

Keywords: parallel I/O, multiprocessor file system, pario-bib

poole:sio-survey:
James T. Poole. Preliminary survey of I/O intensive applications. Technical Report CCSF-38, Scalable I/O Initiative, Caltech Concurrent Supercomputing Facilities, Caltech, 1994.

Keywords: parallel I/O, pario-bib, multiprocessor file system, file access pattern, checkpoint

Comment: Goal is to collect a set of representative applications from biology, chemistry, earth science, engineering, graphics, and physics, use performance-monitoring tools to analyze them, create templates and benchmarks that represent them, and then later to evaluate the performance of new I/O tools created by rest of the SIO initiative. Seem to be four categories of I/O needs: input, output, checkpoint, and virtual memory (``out-of-core'' scratch space). Not all types are significant in all applications. (Two groups mention databases and the need to perform computationally complex queries.) Large input is typically raw data (seismic soundings, astronomical observations, satellite remote sensing, weather information). Sometimes there are real-time constraints. Output is often periodic, e.g., the state of the system every few timesteps; typically the volume would increase along with I/O capacity and bandwidth. Checkpointing is a common request; preferably allowing application to choose what and when to checkpoint, and definitely including the state of files. Many kinds of out-of-core: 1) temp files between passes (often written and read sequentially), 2) regular patterns like FFT, matrix transpose, solvers, and single-pass read/compute/write, 3) random access, e.g., to precomputed tables of integrals. Distinct differences in the ways people choose to divide data into files; sometimes all in one huge file, sometimes many ``small'' files (e.g., one per processor, one per timestep, one per region, etc.). Important: overlap of computation and I/O, independent access by individual processors. Not always important: ordering of records read or written by different processors, exposing the I/O model to the application writer. Units of I/O seem to be either (sub)matrices (1-5 dimensions) or items in a collection of objects (100-10000 bytes each). Data sets varied up to 1 TB; bandwidth needs varied up to 1 GB/s. See also bagrodia:sio-character, choudhary:sio-language, bershad:sio-os.

poplawski:simulation:
Anna L. Poplawski and David M. Nicol. An investigation of out-of-core parallel discrete-event simulation. In Proceedings of the Winter Simulation Conference, pages 524-530. IEEE Computer Society Press, December 1999.

Abstract: In large-scale discrete-event simulations the size of a computer's physical memory limits the size of the system to be simulated. Demand paging policies that support virtual memory are generally ineffective. Use of parallel processors to execute the simulation compounds the problems, as memory can be tied down due to synchronization needs. We show that by taking more direct control of disks it is possible to break through the memory bottleneck, without significantly increasing overall execution time. We model one approach to conducting out-of-core parallel simulation, identifying relationships between execution, memory, and I/O costs that admit good performance.

Keywords: discrete-event simulation, parallel computing, out-of-core application, parallel I/O, pario-bib

poston:hpfs:
Alan Poston. A high performance file system for UNIX. In Proceedings of the USENIX Workshop on UNIX and Supercomputers, pages 215-226, 1988.

Keywords: file system, unix, parallel I/O, disk striping, pario-bib

Comment: A new file system for Unix based on striped files. Better performance for sequential access, better for large-file random access and about the same for small-file random access. Allows full striping track prefetch, or even volume prefetch. Needs a little bit of buffer management change. Talks about buffer management and parity blocks.

prabhakar:browsing:
Sunil Prabhakar, Divyakant Agrawal, Amr El Abbadi, Ambuj Singh, and Terence Smith. Browsing and placement of multiresolution images on parallel disks. In Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed Systems, pages 102-113, San Jose, CA, November 1997. ACM Press.

Abstract: With rapid advances in computer and communication technologies, there is an increasing demand to build and maintain large image repositories. In order to reduce the demands on I/O and network resources, multiresolution representations are being proposed for the storage organization of images. Image decomposition techniques such as wavelets can be used to provide these multiresolution images. The original image is represented by several coefficients, one of them with visual similarity to the original image, but at a lower resolution. These visually similar coefficients can be thought of as thumbnails or icons of the original image. This paper addresses the problem of storing these multiresolution coefficients on disks so that thumbnail browsing as well as image reconstruction can be performed efficiently. Several strategies are evaluated to store the image coefficients on parallel disks. These strategies can be classified into two broad classes depending on whether the access pattern of the images is used in the placement. Disk simulation is used to evaluate the performance of these strategies. Simulation results are validated with results from experiments with real disks and are found to be in good agreement. The results indicate that significant performance improvements can be achieved with as few as four disks by placing image coefficients based upon browsing access patterns.

Keywords: multimedia, parallel I/O, pario-bib

Comment: They use simulation to study several different placement policies for the thumbnail and varying-resolution versions of images on a disk array.

pratt:twofs:
Terrence W. Pratt, James C. French, Phillip M. Dickens, and Stanley A. Janet, Jr. A comparison of the architecture and performance of two parallel file systems. In Proceedings of the Fourth Conference on Hypercube Concurrent Computers and Applications, pages 161-166, Monterey, CA, 1989. Golden Gate Enterprises, Los Altos, CA.

Keywords: parallel I/O, Intel iPSC/2, nCUBE, pario-bib

Comment: Simple comparison of the iPSC/2 and nCUBE/10 parallel I/O systems. Short description of each system, with simple transfer rate measurements. See also french:ipsc2io-tr.

preslan:gfs:
Kenneth W. Preslan, Andrew P. Barry, Jonathan E. Brassow, Grant M. Erickson, Erling Nygaard, Christopher J. Sabol, Steven R. Soltis, David C. Teigland, and Matthew T. O'Keefe. A 64-bit, shared disk file system for Linux. In Proceedings of the Seventh NASA Goddard Conference on Mass Storage Systems and Technologies, pages 22-41, San Diego, CA, March 1999. IEEE Computer Society Press.

Keywords: Linux, shared file system, network-attached disks, disk striping, parallel I/O, pario-bib

Comment: They discuss a shared, serverless, file system for Linux that integrates IP-based network attached storage and Fibre-Channel- based storage area networks. Based on soltis:gfs.

prost:mpi-io:
Jean-Pierre Prost, Marc Snir, Peter Corbett, and Dror Feitelson. MPI-IO, a message-passing interface for concurrent I/O. Technical Report RC 19712 (87394), IBM T.J. Watson Research Center, August 1994.

Keywords: parallel I/O, message-passing, multiprocesor file system interface, pario-bib

Comment: See newer version mpi-ioc:mpi-io5.

rab:raidbook:
The RAIDBook: A source book for RAID technology. The RAID Advisory Board, Lino Lakes, MN, June 9 1993. First Edition.

Keywords: RAID, disk array, parallel I/O, pario-bib

Comment: Basically, an educational piece about the basics of RAID technology. Helps to define terms across the industry. Written by the RAID advisory board, which is an industry consortium. Overviews RAID, RAID levels, non-Berkeley RAID levels. List of Board members. Bibliography.

rabenseifner:benchmark:
Rolf Rabenseifner, Alice E. Koniges, Jean-Pierre Prost, and Richard Hedges. The parallel effective I/O bandwidth benchmark: b_eff_io. In Christophe Cerin and Hai Jin, editors, Parallel I/O for Cluster Computing, chapter 4, pages 107-132. Kogan Page Ltd., February 2004.

Keywords: parallel I/O benchmarks, MPI-IO, pario-bib

rajaram:thesis:
Kumaran Rajaram. Principal design criteria influencing the performance of a portable, high performance parallel I/O implementation. Master's thesis, Department of Computer Science, Mississippi State University, May 2002.

Keywords: MPI-IO, MPI, parallel I/O, pario-bib

rajasekaran:out-of-core:
Sanguthevar Rajasekaran. Out-of-core computing on mesh connected computers. Journal of Parallel and Distributed Computing, 64(11):1311-1317, November 2004.

Abstract: Several models of parallel disks are found in the literature. These models have been proposed to alleviate the I/O bottleneck arising in handling voluminous data. These models have the general theme of assuming multiple disks. For instance the parallel disks model assumes D disks and a single computer. It is also assumed that a block of data from each of the D disks can be fetched into the main memory in one parallel I/O operation. In this paper, we study a model where there are more than one processors and each processor has an associated disk. In addition to the I/O cost, one also has to account for the inter-processor communication costs. To begin with we study the mesh and we investigate the performance of the mesh with respect to out-of-core computing. As a case study we consider the problem of sorting. The goal of this paper is to study the properties of this model. CP 2004 Elsevier Inc. All rights reserved. (27 Refs.)

Keywords: out-of-core, sorting, parallel disk model, performance analysis, pario-bib

rajasekaran:selection:
Sanguthevar Rajasekaran. Selection algorithms for parallel disk systems. Journal of Parallel and Distributed Computing, 61(4):536-544, April 2001.

Abstract: With the widening gap between processor speeds and disk access speeds, the I/O bottleneck has become critical. Parallel disk systems have been introduced to alleviate this bottleneck. In this paper we present deterministic and randomized selection algorithms for parallel disk systems. The algorithms to be presented, in addition to being asymptotically optimal, have small underlying constants in their time bounds and hence have the potential of being practical.

Keywords: I/O algorithms, parallel I/O, pario-bib

ramachandran:msthesis:
Harish Ramachandran. Design and implementation of the system interface for PVFS2. Master's thesis, Clemson University, December 2002.

Abstract: As Linux clusters emerged as an alternative to traditional supercomputers one of the problems faced was the absence of a high-performance parallel file system comparable to the file systems on the commercial machines. The Parallel Virtual FileSystem(PVFS) developed at Clemson University has attempted to address this issue. PVFS is a parallel file system currently used in Parallel I/O research and as a parallel file system on Linux clusters running high-performance parallel applications.

An important component of parallel file systems is the file system interface which has different requirements compared to the normal UNIX interface particularly the I/O interface. A parallel I/O interface is required to provide support for non-contiguous access patterns, collective I/O, large file sizes in order to achieve good performance with parallel applications. As it supports significantly different functionality, the interface exposed by a parallel file system assumes importance. So, the file system needs to either directly provide a parallel I/O interface or at the least support for such an interface to be implemented on top.

The PVFS2 System Interface is the native file system interface for PVFS2 - the next generation of PVFS. The System Interface provides support for multiple interfaces such as a POSIX interface or a parallel I/O interface like MPI-IO to access PVFS2 while also allowing the benefits of abstraction by decoupling the System Interface from the actual file system implementation. This document discusses the design and implementation of the System Interface for PVFS2.

Keywords: pvfs, parallel file system, system interface, pario-bib

rauch:partitioncast:
Felix Rauch, Christian Kurmann, and Thomas M. Stricker. Partition cast- modelling and optimizing the distribution of large data sets in PC clusters. In Proceedings of the Sixth International Euro-Par Conference, volume 1900 of Lecture Notes in Computer Science, pages 1118-1131, Munich, August 2000. Springer-Verlag.

Abstract: Multicasting large amounts of data efficiently to all nodes of a PC cluster is an important operation. In the form of a partition cast it can be used to replicate entire software installations by cloning. Optimizing a partition cast for a given cluster of PCs reveals some interesting architectural tradeoffs, since the fastest solution does not only depend on the network speed and topology, but remains highly sensitive to other resources like the disk speed, the memory system performance and the processing power in the participating nodes. We present an analytical model that guides an implementation towards an optimal configuration for any given PC cluster. The model is validated by measurements on our cluster using Gigabit- and Fast Ethernet links. The resulting simple software tool, Dolly, can replicate an entire 2 GByte Windows NT image onto 24 machines in less than 5 minutes.

Keywords: multicast, network, cluster, parallel I/O, pario-bib

rauch:thesis:
Felix Rauch. Distribution and Storage of Data on Local and Remote Disks in Multi-Use Clusters of PCs. PhD thesis, Dept. of Computer Science, Swiss Federal Institute of Technology (ETH Zurich), Zurich, Switzerland, 2003. Full online publication to follow.

Abstract: Over the last few decades, the power of personal computers (PCs) has grown steadily, following the exponential growth rate predicted by Moore's law. The trend towards the commoditization of PC components (such as CPUs, memories, high-speed interconnects and disks) results in a highly attractive price/performance ratio of the systems built from those components. Following these trends, I propose to integrate the commodity IT resources of an entire company or organziation into multi-use clusters of commodity PCs. These include compute farms, experimental clusters as well as desktop PCs in offices and labs. This thesis follows a bottom-up architectural approach and deals with hardware and system-software architecture with a tight focus on performance and efficiency. In contrast, the Grid view of providing services instead of hardware for storage and computation deals mostly with problems of capability, service and security rather than performance and modelling thereof.

Multi-use clusters of commodity PCs have by far enough storage on their hard-disk drives for the required local operating-system (OS) installation and therfore there is a lot of excess storage in a multi-use cluster. This additional disk space on the nodes should be put to a better use for a variety of interesting applications e.g.\ for on-line analytic data processing (OLAP). The specific contributions of the thesis include solutions to four important problems of optimized resource usage in multi-use-cluster environments.

Analytic models of computer systems are important to understand the performance of current systems and to predict the performance of future systems early in the design stage. The thesis instroduces a simple analytic model of data streams in clusters. The model considers the topology of data streams as well as the limitations of the edges and nodes. It also takes into account the limitations of the resources within the nodes, which are passed through by the data streams.

Using the model, the thesis evaluates different data-casting techniques that can be used to replicate OS installations to many nodes in clusters. The different implementations based on IP multicast, star-, tree- and multi-drop-chain topologies are evaluated with the analytic model as well as with experimental measurements. As a result of the evaluation, the multi-drop chain is proposed as most suitable replication technique.

When working with multi-use clusters, we noticed that maintenance of the highly replicated system software is difficult, because there are many OS installations in different versions and customisations. Since it is desirable to backup all older versions and customisations of all OS installations, I implemented several techniques to archive the large amounts of highly redundant data contained in the nodes' OS partitions. The techniques take different approaches of comparing the data, but are all OS independent and work with whole partition images. The block repositories that store only unique data blocks prove to be an efficient data storage for OS installations in multi-use clusters.

Finally we look at the possibilities to take advantage of the excess storage on the many nodes' hard-disk drives. The thesis investigates several ways to gather data from multiple server nodes to a client node running the applications. The combined storage can be used for data-warehousing applications. While powerful multi-CPU ``killer workstations'' with redundant arrays of inexpensive disks (RAIDs) are the current workhorses for data warehousing because of their compatibility with standard databases, they are still expensive compared to multi-use clusters of commodity PCs. On the other end several researchers in databases have tried to find domain specific solutions using middleware. My thesis looks at the question whether, and to what extent, the cost-efficient multi-use clusters of commodity PCs can provide an alternative data-warehousing platform with an OS solution that is transparent enough to run a commodity database system. To answer the question about the most suitable software layer for a possible implementation, the thesis compares different distributed file systems and distributed-device systems against the middleware solution that uses database-internal communication for distributing partial queries. The different approaches are modelled with the analytic model and evaluated with a microbenchmark as well as the TPC-D decision-support benchmark.

Given the existing systems and software packages it looks like the domain specific mid\-dle\-ware-approach delivers best performance, and in the area of the transparent OS-only solution, distributed devices are faster than the more complex distributed file systems and achieve similar performance to a system with local disks only.

Keywords: Cluster of PCs, commodity computing, data streams, multicast, cloning, data storage, distributed file systems, distributed devices, network-attached disks, OS image distribution, pario-bib

Comment: See also rauch:partitioncast

reddy:compiler:
A. L. Narasimha Reddy, P. Banerjee, and D. K. Chen. Compiler support for parallel I/O operations. In Proceedings of the 1991 International Conference on Parallel Processing, pages II:290-II:291, St. Charles, IL, 1991. CRC Press.
See also earlier version reddy:compiler-tr.

Keywords: parallel I/O, pario-bib, compilers

Comment: This version is only 2 pages. reddy:compiler-tr provides the full text. They discuss three primary issues. 1) Overlapping I/O with computation: the compiler's dependency analysis is used to decide when some I/O may be moved up and performed asynchronously with other computation. 2) Parallel execution of I/O statements: if all sizes are known at compile time, the compiler can insert seeks so that processes can access the file independently. When writing in the presence of conditionals they even propose skipping by the maximum and leaving holes in the file, and they claim that this doesn't hurt (!). 3) Parallel format conversion: again, if there are fixed-width fields the compiler can have processors seek to different locations, read data independently, and do format conversion in parallel. Really all this is saying is that fixed-width fields are good for parallelism, and that compilers could take advantage of them.

reddy:compiler-tr:
A. L. Narasimha Reddy, P. Banerjee, and D. K. Chen. Compiler support for parallel I/O operations. Technical report, IBM Yorktown Heights, 1991. Also appeared in ICPP '91.
See also later version reddy:compiler.

Keywords: parallel I/O, pario-bib, compilers

reddy:hyperio1:
A. L. Reddy, P. Banerjee, and Santosh G. Abraham. I/O embedding in hypercubes. In Proceedings of the 1988 International Conference on Parallel Processing, volume 1, pages 331-338, St. Charles, IL, 1988. Pennsylvania State Univ. Press.
See also later version reddy:hyperio3.

Keywords: parallel I/O, hypercube, pario-bib

Comment: Emphasis is on adjacency. It also implies (and they assume) that data is distributed well across the disks so no data needs to move beyond the neighbors of an I/O node. Still, the idea of adjacency is good since it allows for good data distribution while not requiring it, and for balancing I/O procs among procs in a good way. Also avoids messing up the hypercube regularity with (embedded) dedicated I/O nodes.

reddy:hyperio2:
A. L. Reddy and P. Banerjee. I/O issues for hypercubes. In ACM International Conference on Supercomputing, pages 72-81, 1989.
See also later version reddy:hyperio3.

Keywords: parallel I/O, hypercube, pario-bib

Comment: See reddy:hyperio3 for extended version.

reddy:hyperio3:
A. L. Narasimha Reddy and Prithviraj Banerjee. Design, analysis, and simulation of I/O architectures for hypercube multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(2):140-151, April 1990.
See also earlier version reddy:hyperio1.

Keywords: parallel I/O, hypercube, pario-bib

Comment: An overall paper restating their embedding technique from reddy:hyperio1, plus a little bit of evaluation along the lines of reddy:pario2, plus some ideas about matrix layout on the disks. They claim that declustering is important, since synchronized disks do not provide enough parallelism, especially in the communication across the hypercube (since the synchronized disks must hang off one node).

reddy:pario:
A. Reddy and P. Banerjee. An evaluation of multiple-disk I/O systems. In Proceedings of the 1989 International Conference on Parallel Processing, pages I:315-322, St. Charles, IL, 1989. Pennsylvania State Univ. Press.
See also later version reddy:pario2.

Keywords: parallel I/O, disk array, disk striping, pario-bib

Comment: see also expanded version reddy:pario2

reddy:pario2:
A. Reddy and P. Banerjee. Evaluation of multiple-disk I/O systems. IEEE Transactions on Computers, 38:1680-1690, December 1989.
See also earlier version reddy:pario.
See also later version reddy:pario3.

Keywords: parallel I/O, disk array, disk striping, pario-bib

Comment: Compares declustered disks (sortof MIMD-like) to synchronized-interleaved (SIMD-like). Declustering needed for scalability, and is better for scientific workloads. Handles large parallelism needed for scientific workloads and for RAID-like architectures. Synchronized interleaving is better for general file system workloads due to better utilization and reduction of seek overhead.

reddy:pario3:
A. L. Reddy and Prithviraj Banerjee. A study of parallel disk organizations. Computer Architecture News, 17(5):40-47, September 1989.
See also earlier version reddy:pario2.

Keywords: parallel I/O, disk array, disk striping, pario-bib

Comment: nothing new over expanded version reddy:pario2, little different from reddy:pario

reddy:perfectio:
A. L. Narasimha Reddy and Prithviraj Banerjee. A study of I/O behavior of Perfect benchmarks on a multiprocessor. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 312-321, 1990.

Keywords: parallel I/O, file access pattern, workload, multiprocessor file system, benchmark, pario-bib

Comment: Using five applications from the Perfect benchmark suite, they studied both implicit (paging) and explicit (file) I/O activity. They found that the paging activity was relatively small and that sequential access to VM was common. All access to files was sequential, though this may be due to the programmer's belief that the file system is sequential. Buffered I/O would help to make transfers bigger and more efficient, but there wasn't enough rereferencing to make caching useful.

reddy:thesis:
Narasimha Reddy L. Annapareddy. Parallel Input/Output Architectures for Multiprocessors. PhD thesis, University of Illinois at Urbana-Champaign, August 1990. Available as technical report UILU-ENG-90-2235 or CRHC-90-5.

Keywords: parallel I/O, multiprocessor architecture, pario-bib

Comment: Much of the material in this thesis has been published in other papers, i.e., reddy:io, reddy:notsame, reddy:hyperio1, reddy:hyperio2, reddy:hyperio3, reddy:pario, reddy:pario2, reddy:pario3, reddy:perfectio, reddy:mmio. He traces some ``Perfect'' benchmarks to determine paging and file access patterns. He simulates a variety of declustered, synchronized, and synchronized-declustered striping configurations under both ``file'' and ``scientific'' workloads to determine which is best. He proposes embeddings for I/O nodes in hypercubes, where the I/O nodes are just like regular nodes but with an additional I/O processor and disk(s). He studies the disk configurations again, when embedded in hypercubes. He proposes ways to lay out matrices (in blocked form) across disks in a hypercube. He proposes a new parity-based fault-tolerance scheme that prevents overloading during failure-mode access. And he considers compiler issues: overlapping I/O with computation, parallelizing I/O statements, and parallel format conversion.

reed:panel:
Daniel A. Reed, Charles Catlett, Alok Choudhary, David Kotz, and Marc Snir. Parallel I/O: Getting ready for prime time. IEEE Parallel and Distributed Technology, pages 64-71, Summer 1995. Edited transcript of panel discussion at the 1994 International Conference on Parallel Processing.

Keywords: parallel I/O, pario-bib, dfk

Comment: This paper summarizes the presentations made by panel members at the ICPP panel discussion on parallel I/O, and the ensuing discussion.

reed:sio-book:
Daniel A. Reed, editor. Scalable Input/Output: Achieving System Balance. Scientific and Engineering Computation. MIT Press, October 2003.

Keywords: I/O characterization, checkpointing, collective I/O, parallel database, I/O optimization, pario-bib

rettberg:monarch:
Randall D. Rettberg, William R. Crowther, Philip P. Carvey, and Raymond S. Tomlinson. The Monarch Parallel Processor hardware design. IEEE Computer, 23(4):18-30, April 1990.

Keywords: MIMD, parallel architecture, shared memory, parallel I/O, pario-bib

Comment: This describes the Monarch computer from BBN. It was never built. 65K processors and memory modules. 65GB RAM. Bfly-style switch in dance-hall layout. Switch is synchronous; one switch time is a frame (one microsecond, equal to 3 processor cycles) and all processors may reference memory in one frame time. Local I-cache only. Contention reduces full bandwidth by 16 percent. Full 64-bit machine. Custom VLSI. Each memory location has 8 tag bits. One allows for a location to be locked by a processor. Thus, any FetchAndOp or full/empty model can be supported. I/O is done by adding I/O processors (up to 2K in a 65K-proc machine) in the switch. They plan 200 disks, each with an I/O processor, for 65K nodes. They would spread each block over 9 disks, including one for parity (essentially RAID).

riedel:active-mining:
Erik Riedel, Garth Gibson, and Christos Faloutsos. Active storage for large-scale data mining and multimedia.. In A. Gupta, O. Shmuel, and J. Widom, editors, 24th Annual International Conference on Very Large Data Bases (VLDB'98), pages 62-73, New York, NY, August 1998. Morgan Kaufmann Publishers Inc.

Abstract: The increasing performance and decreasing cost of processors and memory are causing system intelligence to move into peripherals from the CPU. Storage system designers are using this trend toward "excess" compute power to perform more complex processing and optimizations inside storage devices. To date, such optimizations have been at relatively low levels of the storage protocol. At the same time, trends in storage density, mechanics, and electronics are eliminating the bottleneck in moving data off the media and putting pressure on interconnects and host processors to move data more efficiently. We propose a system called Active Disks that takes advantage of processing power on individual disk drives to run application-level code. Moving portions of an application's processing to execute directly at disk drives can dramatically reduce data traffic and take advantage of the storage parallelism already present in large systems today. We discuss several types of applications that would benefit from this capability with a focus on the areas of database, data mining, and multimedia. We develop an analytical model of the speedups possible for scan-intensive applications in an Active Disk system. We also experiment with a prototype Active Disk system using relatively low-powered processors in comparison to a database server system with a single, fast processor. Our experiments validate the intuition in our model and demonstrate speedups of 2x on 10 disks across four scan-based applications. The model promises linear speedups in disk arrays of hundreds of disks, provided the application data is large enough. (57 refs.)

Keywords: active disks, active storage, application level code, database server, data mining, pario-bib

riedel:thesis:
Erik Riedel. Active Disks - Remote Execution for Network-Attached Storage. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, November 1999.

Abstract: Today's commodity disk drives, the basic unit of storage for computer systems large and small, are actually small computers, with a processor, memory, and 'network' connection, along with the spinning magnetic material that permanently stores the data. As more and more of the information in the world becomes digitally available, and more and more of our daily activities are recorded and stored, people are increasingly finding value in analyzing, rather than simply storing and forgetting, these large masses of data. Sadly, advances in I/O performance have lagged the development of commodity processor and memory technology, putting pressure on systems to deliver data fast enough for these types of data-intensive analysis. This dissertation proposes a system called Active Disks that takes advantage of the processing power on individual disk drives to run application-level code. Moving portions of an application's processing directly to the disk drives can dramatically reduce data traffic and take advantage of the parallelism already present in large storage systems. It provides a new point of leverage to overcome the I/O bottleneck.

This dissertation presents the factors that will make Active Disks a reality in the not-so-distant future, the characteristics of applications that will benefit from this technology, an analysis of the improved performance and efficiency of systems built around Active Disks, and a discussion of some of the optimizations that are possible with more knowledge available directly at the devices. It also compares this work with previous work on database machines and examines the opportunities that allow us to take advantage of these promises today where previous approaches have not succeeded. The analysis is motivated by a set of applications from data mining, multimedia, and databases and is performed in the context of a prototype Active Disk system that shows dramatic speedups over a system with traditional, "dumb" disks.

Keywords: storage, active disks, embedded systems, architecture, databases, data mining, disk scheduling, pario-bib

riesen:experience:
Rolf Riesen, Arthur B. Maccabe, and Stephen R. Wheat. Experience in implementing a parallel file system. Available for ftp?, March 1993.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: They describe their experience building a file system for SUNMOS. Paper describes tuning the SCSI device, their striping strategy, their message-passing tricks, and some performance results.

rochberg:ctip:
David Rochberg and Garth Gibson. Prefetching over a network: Early experience with CTIP. ACM SIGMETRICS Performance Evaluation Review, 25(3):29-36, December 1997.

Keywords: file prefetching, distributed file system, parallel I/O, pario-bib

Comment: Part of a special issue on parallel and distributed I/O.

rodriguez:nnt:
Bernardo Rodriguez, Leslie Hart, and Tom Henderson. Programming regular grid-based weather simulation models for portable and fast execution. In Proceedings of the 1995 International Conference on Parallel Processing, pages III:51-59, St. Charles, IL, August 1995. CRC Press.

Keywords: weather simulation, scientific application, parallel I/O, pario-bib

Comment: Related to hart:grid.

rosales:cds:
F. Rosales, J. Carretero, F. Pérez, P. de Miguel, F. Garc\'\ia, and L. Alonso. CDS design: A parallel disk server for multicomputers. Technical Report FIM/83.1/DATSI/94, Universidad Politecnic Madrid, Madrid, Spain, 1994.

Abstract: This document describes the detailed design of the CDS, one of the components of the Cache Coherent File System (CCFS). CCFS has three main components: Client File Server (CLFS), Local File Server (LFS), Concurrent Disk System (CDS). A CDSs is located on each disk node, to develop input/output functions in a per node basis. The CDS will interact with the microkernel drivers to execute real input/output and to manage the disk system. The CDS includes general services to distribute accesses to disks, controlling partition information, etc.

Keywords: multiprocessor file system, parallel I/O, pario-bib

Comment: See carretero:*, rosales:cds, perez:clfs.

rosti:impact:
Emilia Rosti, Giuseppe Serazzi, Evgenia Smirni, and Mark S. Squillante. The impact of I/O on program behavior and parallel scheduling. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, pages 56-65. ACM Press, June 1998.

Keywords: CPU scheduling, disk scheduling, I/O model, parallel I/O, pario-bib

rothnie:ksr:
James Rothnie. Kendall Square Research: introduction to the KSR1. In Proceedings of the 1992 DAGS/PC Symposium, pages 200-210, Hanover, NH, June 23-27 1992. Dartmouth Institute for Advanced Graduate Studies.

Keywords: parallel architecture, shared memory, MIMD, interconnection network, parallel I/O, memory-mapped files, pario-bib

Comment: Overview of the KSR1.

roy:unixfile:
Paul J. Roy. Unix file access and caching in a multicomputer environment. In Proceedings of the Usenix {Mach III} Symposium, pages 21-37, 1993.

Keywords: multiprocessor file system, Unix, Mach, memory mapped file, pario-bib

Comment: Describes the modifications to the OSF/1 AD file system for a multicomputer environment. Goal is for normal Unix files, not supercomputer access. The big thing was separation of the caching from backing store management, by pulling out the cache management into the Extended Memory Management (XMM) subsystem. Normally OSF/1 maps files to Mach memory objects, which are then accessed (through read() and write()) using bcopy(). XMM makes it possible to access these memory objects from any node in the system, providing coherent compute-node caching of pages from the memory object. It uses tokens controlled by the XMM server at the file's server node to support a single-reader, single-writer policy on the whole file, but migrating page by page. They plan to extend to multiple writers, but atomicity constraints on the file pointer and metadata make it difficult. Files are NOT striped across file servers or I/O nodes. Several hacks were necessary to work around Mach interface problems. Unix buffer caching is abandoned. Future includes supercomputer support in the form of turning off all caching. No performance evaluation included. See zajcew:osf1.

rullman:interface:
Brad Rullman and David Payne. An efficient file I/O interface for parallel applications. DRAFT presented at the Workshop on Scalable I/O, Frontiers '95, February 1995.

Keywords: parallel I/O, multiprocessor file system interface, pario-bib

Comment: They believe that the API should be Unix-compatible, systems must support scalable performance on large transfers of data, and that systems must support very large files. Most of the paper is specifics about the Paragon PFS interface, which has many features not mentioned in earlier PFS papers. Contact brad@ssd.intel.com or payne@ssd.intel.com.

ryan:cfs:
Steve Ryan. CFS workload demonstration code. WWW ftp://ftp.cs.dartmouth.edu/pub/pario/examples/CFS3D.tar.Z, July 1991. A simple program demonstrating CFS usage for ARC3D-like applications.

Keywords: parallel I/O workload, file access pattern, Intel, pario-bib

Comment: A sample code that tries to behave like a parallel ARC3D in terms of its output. It writes two files, one containing three three-dimensional matrices X, Y, and Z, and the other containing the four-dimensional matrix Q. The matrices are spread over all the nodes, and each file is written in parallel by the processors. See also ryan:navier.

ryan:navier:
J. S. Ryan and S. K. Weeratunga. Parallel computation of 3-D Navier-Stokes flowfields for supersonic vehicles. In 31st Aerospace Sciences Meeting and Exhibit, Reno, NV, 1993. AIAA Paper 93-0064.

Keywords: parallel application, CFD, parallel I/O, pario-bib

Comment: This paper goes with the ryan:cfs code example. Describes their parallel implementation of the ARC3D code on the iPSC/860. A section of the paper considers I/O, which is to write out a large multidimensional matrix at each timestep. They found that it was actually faster to write to separate files because of congestion in the I/O nodes was hurting performance. They never got more than 2 MB/s, even so, on a system that should obtain 7-10 MB/s peak.

salem:diskstripe:
Kenneth Salem and Hector Garcia-Molina. Disk striping. In Proceedings of the IEEE 1986 Conference on Data Engineering, pages 336-342, 1986.
See also earlier version salem:striping.

Keywords: parallel I/O, disk striping, disk array, pario-bib

Comment: See the techreport salem:striping for a nearly identical but more detailed version.

salem:striping:
Kenneth Salem and Hector Garcia-Molina. Disk striping. Technical Report 332, EECS Dept. Princeton Univ., December 1984.
See also later version salem:disktripe.

Keywords: parallel I/O, disk striping, disk array, pario-bib

Comment: Cite salem:diskstripe instead. Basic paper on striping. For uniprocessor, single-user machine. Interleaving asynchronous, even without matching disk locations though this is discussed. All done with models.

salmon:cubix:
John Salmon. CUBIX: Programming hypercubes without programming hosts. In Proceedings of the Second Conference on Hypercube Multiprocessors, pages 3-9, 1986.

Keywords: hypercube, multiprocessor file system interface, pario-bib

Comment: Previously, hypercubes were programmed as a combination of host and node programs. Salmon proposes to use a universal host program that acts essentially as a file server, responding to requests from the node programs. Two modes: crystalline, where node programs run in loose synchrony, and amorphous, where node programs are asynchronous. In the crystalline case, files have a single file pointer and are either single- or multiple- access; single access means all nodes must simultaneously issue the same request; multiple access means they all simultaneously issue the same request with different parameters, giving an interleaved pattern. Amorphous allows asynchronous activity, with separate file pointers per node.

salmon:nbody:
John Salmon and Michael Warren. Parallel out-of-core methods for N-body simulation. In Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing, March 1997.

Abstract: Hierarchical treecodes have, to a large extent, converted the compute-bound N-body problem into a memory-bound problem. The large ratio of DRAM to disk pricing suggests use of out-of-core techniques to overcome memory capacity limitations. We will describe a parallel, out-of-core treecode library, targeted at machines with independent secondary storage associated with each processor. Borrowing the space-filling curve techniques from our in-core library, and ``manually'' paging, results in excellent spatial and temporal locality and very good performance.

Keywords: parallel I/O, out of core applications, scientific computing, pario-bib

Comment: Only published on CD-ROM

sanders:async:
Peter Sanders. Asynchronous scheduling of redundant disk arrays.. In Twelfth ACM Symposium on Parallel Algorithms and Architectures, pages 89-98, Bar Harbour, MN, USA, July 2000. ACM Press.
See also later version #sanders:jasync#.

Abstract: Random redundant allocation of data to parallel disk arrays can be exploited to achieve low access delays. New algorithms are proposed which improve the previously known shortest queue algorithm by systematically exploiting that scheduling decisions can be deferred until a block access is actually started on a disk. These algorithms are also generalized for coding schemes with low redundancy. Using extensive experiments, practically important quantities are measured which have so far eluded an analytical treatment: the delay distribution when a stream of requests approaches the limit of the system capacity, the system efficiency for parallel disk applications with bounded prefetching buffers, and the combination of both for mixed traffic. A further step towards practice is taken by outlining the system design for alpha : automatically load-balanced parallel hard-disk array. (31 refs.)

Keywords: parallel disks, lazy scheduling, random redundant storage, I/O algorithm, random block placement, bipartite matching, pario-bib

Comment: Also see later version sanders:jasync.

sanders:datatypes:
Darren Sanders, Yoonho Park, and Maciej Brodowicz. Implementation and performance of MPI-IO file access using MPI datatypes. Technical Report UH-CS-96-12, University of Houston, November 1996.

Abstract: In this paper we document our experience implementing MPI-IO file access using MPI datatypes. We present performance results and discuss two significant problems that stem from the flexibility of MPI datatypes. First, MPI datatypes can be used to specify non-contiguous access patterns. Optimizing data transfers for such patterns is difficult. Second, the behavior of MPI datatypes in a heterogenous environment is not well-defined.

Keywords: parallel I/O, multiprocessor file system interface, pario-bib

Comment: They devise several file-access strategies for different situations, depending on the particulars of the etypes and filetypes in use: sequential, two-phase I/O, one file access per etype (random access), and one file access per etype element (random access with smaller pieces). They measure the performance of their system with example patterns that trigger each strategy. It would be nice to see a more extensive performance analysis of their implementation, and of their strategies.

sanders:jasync:
Peter Sanders. Asynchronous scheduling of redundant disk arrays. IEEE Transactions on Computers, 52(9):1170-1184, September 2003.

Abstract: Allocation of data to a parallel disk using redundant storage and random placement of blocks can be exploited to achieve low access delays. New algorithms are proposed which improve the previously known shortest queue algorithm by systematically exploiting the fact that scheduling decisions can be deferred until a block access is actually started on a disk. These algorithms are also generalized for coding schemes with low redundancy. Using extensive simulations, practically important quantities are measured which have so far eluded an analytical treatment: The delay distribution when a stream of requests approaches the limit of the system capacity, the system efficiency for parallel disk applications with bounded prefetching buffers, and the combination of both for mixed traffic. A further step toward practice is taken by outlining the system design for a.: automatically load-balanced parallel hard-disk array. Additional algorithmic measures are proposed for a that allow variable sized blocks, seek time reduction, fault tolerance, inhomogeneous systems, and flexible priorization schemes. (41 refs.)

Keywords: parallel disks, lazy scheduling, random redundant storage, I/O algorithm, random block placement, bipartite matching, pario-bib

sanders:models:
P. Sanders. Reconciling simplicity and realism in parallel disk models. Parallel Computing, 28(5):705-723, May 2002.

Abstract: For the design and analysis of algorithms that process huge data sets, a machine model is needed that handles parallel disks. There seems to be a dilemma between simple and flexible use of such a model and accurate modeling of details of the hardware. This paper explains how many aspects of this problem can be resolved. The programming model implements one large logical disk allowing concurrent access to arbitrary sets of variable size blocks. This model can be implemented efficiently on multiple independent disks even if zones with different speed, communication bottlenecks and failed disks are allowed. These results not only provide useful algorithmic tools but also imply a theoretical justification for studying external memory algorithms using simple abstract models. The algorithmic approach is random redundant placement of data and optimal scheduling of accesses. The analysis generalizes a previous analysis for simple abstract external memory models in several ways (higher efficiency, variable block sizes, more detailed disk model).

Keywords: parallel I/O, pario-bib

savage:afraid:
Stefan Savage and John Wilkes. AFRAID- a frequently redundant array of independent disks. In Proceedings of the 1996 USENIX Technical Conference, pages 27-39, January 1996.

Keywords: RAID, disk array, parallel I/O, pario-bib

Comment: RAID array that relaxes the consistency requirements, to not write parity during busy periods, then to go back and update parity during idle periods. Thus you sacrifice a little reliability for performance; you can select how much.

scheuermann:partition:
Peter Scheuermann, Gerhard Weikum, and Peter Zabback. Data partitioning and load balancing in parallel disk systems. Technical Report 209, ETH Zurich, January 1994.
See also later version scheuermann:partition2.

Keywords: parallel I/O, disk array, disk striping, load balance, pario-bib

Comment: Updated as scheuermann:partition2. They describe a file system that attempts to choose both the degree of declustering and the striping unit size to accomodate the needs of different files. They also decsribe static and dynamic placement and migration policies to readjust the load across disks. Note that there are several references in the bib that are about their file system, called FIVE. Seems to be the same as scheuermann:tunable.

scheuermann:partition2:
Peter Scheuermann, Gerhard Weikum, and Peter Zabback. Data partitioning and load balancing in parallel disk systems. The VLDB Journal, 7(1):48-66, February 1998.
See also earlier version scheuermann:partition.

Abstract: Parallel disk systems provide opportunities for exploiting I/O parallelism in two possible ways, namely via inter-request and intra-request parallelism. In this paper, we discuss the main issues in performance tuning of such systems, namely striping and load balancing, and show their relationship to response time and throughput. We outline the main components of an intelligent, self-reliant file system that aims to optimize striping by taking into account the requirements of the applications, and performs load balancing by judicious file allocation and dynamic redistributions of the data when access patterns change. Our system uses simple but effective heuristics that incur only little overhead. We present performance experiments based on synthetic workloads and real-life traces.

Keywords: parallel I/O, disk array, disk striping, load balance, pario-bib

Comment: Updated version of scheuermann:partition.

scheuermann:tunable:
Peter Scheuermann, Gerhard Weikum, and Peter Zabback. The case for tunable disk arrays. Publication status unknown., 1993.

Keywords: parallel I/O, disk array, disk striping, pario-bib

Comment: Seems to be the same as scheuermann:partition.

schikuta:bookchap:
Erich Schikuta and Heinz Stokinger. Parallel I/O for clusters: Methodologies and systems. In Rajkumar Buyya, editor, High Peformance Cluster Computing, pages 439-462. Prentice Hall PTR, 1999.

Keywords: parallel file system, cluster computing, parallel I/O, pario-bib

schikuta:pario:
Erich Schikuta and Helmut Wanek. Parallel I/O. International Journal of High Performance Computing Applications, 15(2):162-168, Summer 2001.

Keywords: parallel I/O, pario-bib

Comment: A brief overview of issues in parallel I/O, and a short case study of the data-intensive computational grid at CERN.

schloss:hcsa:
Gary Schloss and Michael Vernick. HCSA: a hybrid client-server architecture. In Proceedings of the IPPS '95 Workshop on Input/Output in Parallel and Distributed Systems, pages 63-77, April 1995.
See also later version schloss:hcsa-book.

Keywords: parallel I/O, pario-bib

Comment: In the context of client-server database systems, they propose to make a compromise between shared-disk architectures, where the disks are all attached to the network and all machines are both clients and servers, and a system where the disks are attached to a single server. Their compromise attaches the disks to both the network and the server.

schloss:hcsa-book:
Gerhard A. Schloss and Michael Vernick. HCSA: A hybrid client-server architecture. In Jain et al. [iopads-book], chapter 15, pages 333-351.
See also earlier version schloss:hcsa.

Abstract: The HCSA (Hybrid Client-Server Architecture), a flexible system layout that combines the advantages of the traditional Client-Server Architecture (CSA) with those of the Shared Disk Architecture (SDA), is introduced. In HCSA, the traditional CSA-style I/O subsystem is modified to give the clients network access to both the server and the server's set of disks. Hence, the HCSA is more fault-tolerant than the CSA since there are two paths between any client and the shared data. Moreover, a simulation study demonstrates that the HCSA is able to support a larger number of clients than the CSA or SDA under similar system workloads. Finally, the HCSA can run applications in either a CSA mode, an SDA mode, or a combination of the two, thus offering backward compatibility with a large number of existing applications.

Keywords: parallel I/O architecture, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

schneider:sp2-io:
David Schneider. Application I/O and related issues on the SP2, 1995. Available at \verb+http://www.tc.cornell.edu/SmartNodes/Newsletters/1994/V6N5/application.html+.

Keywords: parallel I/O, IBM SP-2, pario-bib

schulz:semantic:
Martin Schulz and Daniel A. Reed. Using semantic information to guide efficient i/o on clusters. In Proceedings of the Eleventh IEEE International Symposium on High Performance Distributed Computing, pages 135-142, Edinburgh, Scotland, 2002. IEEE Computer Society Press.

Keywords: I/O, data distribution, medical imaging application, parallel I/O, pario-bib

Comment: The paper describes DIOM (Distributed I/O management), a system to manage data distributed to local disks of a cluster of workstations. The distribution process uses semantic information from both the data set and the application to decide how to distribute the data. The data is stored using a self-describing format (similar to HDF). The description of the data is either stored in a file header, or it is part of a central repository (format identified by file suffix). DIOM decides how to distribute the data based on the application-supplied splitting-pattern , of which there are three types: single (copy all data to a single node), block (divide data evenly between the nodes), round (stripe blocks in a round-robin fashion). Parameters such as stripe size, initial node, etc, are defined by the app.

schulze:raid:
Martin Schulze. Considerations in the design of a RAID prototype. Technical Report UCB/CSD 88/448, UC Berkeley, August 1988.

Keywords: parallel I/O, RAID, disk array, disk architecture, pario-bib

Comment: Very practical description of the RAID I prototype.

schulze:raid2:
Martin Schulze, Garth Gibson, Randy Katz, and David Patterson. How reliable is a RAID?. In Proceedings of IEEE Compcon, Spring 1989.
See also earlier version chen:raid.

Keywords: parallel I/O, reliability, RAID, disk array, disk architecture, pario-bib

Comment: Published version of second paper in chen:raid. Some overlap with schulze:raid, though that paper has more detail.

schwabe:flexible:
Eric J. Schwabe and Ian M. Sutherland. Flexible use of parity storage space in disk arrays. In Proceedings of the Eighth Symposium on Parallel Algorithms and Architectures, pages 99-108, Padua, Italy, June 1996. ACM Press.

Keywords: parallel disks, disk array, parity, RAID, pario-bib

schwabe:jlayouts:
Eric J. Schwabe, Ian M. Sutherland, and Bruce K. Holmer. Evaluating approximately balanced parity-declustered data layouts for disk arrays. Parallel Computing, 23(4):501-523, June 1997.
See also earlier version schwabe:layouts.

Abstract: Parity-declustered data layouts were developed to reduce the time for on-line failure recovery in disk arrays. They generally require perfect balancing of reconstruction workload among the disks; this restrictive balance condition makes such data layouts difficult to construct. In this paper, we consider approximately balanced data layouts, where some variation in the reconstruction workload over the disks is permitted. Such layouts are considerably easier to construct than perfectly balanced layouts. We consider three methods for constructing approximately balanced data layouts, and analyze their performance both theoretically and experimentally. We conclude that on uniform workloads, approximately balanced layouts have performance nearly identical to that of perfectly balanced layouts.

Keywords: disk array, parity, RAID, parallel I/O, pario-bib

schwabe:layouts:
Eric J. Schwabe, Ian M. Sutherland, and Bruce K. Holmer. Evaluating approximately balanced parity-declustered data layouts for disk arrays. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 41-54, Philadelphia, May 1996. ACM Press.
See also later version schwabe:jlayouts.

Abstract: Parity declustering has been used to reduce the time required to reconstruct a failed disk in a disk array. Most existing work on parity declustering uses BIBD-based data layouts, which distribute the workload of reconstructing a failed disk over the remaining disks of the array with perfect balance. For certain array sizes, however, there is no known BIBD-based layout. In this paper, we evaluate data layouts that are approximately balanced - that is, that distribute the reconstruction workload over the disks of the array with only approximate balance. Approximately balanced layouts are considerably easier to construct than perfectly balanced layouts. We consider three methods for generating approximately balanced layouts: randomization, simulated annealing, and perturbing a BIBD-based layout whose size is near the desired size. We compare the performance of these approximately balanced layouts with that of perfectly balanced layouts using a disk array simulator. We conclude that, on uniform workloads, approximately balanced data layouts have performance nearly identical to that of perfectly balanced layouts. Approximately balanced layouts therefore provide the reconstruction performance benefits of perfectly balanced layouts for arrays where perfectly balanced layouts are either not known, or do not exist.

Keywords: parallel I/O, disk array, parity, RAID, pario-bib

scott:matrix:
David S. Scott. Parallel I/O and solving out of core systems of linear equations. In Proceedings of the 1993 DAGS/PC Symposium, pages 123-130, Hanover, NH, June 1993. Dartmouth Institute for Advanced Graduate Studies.

Abstract: Large systems of linear equations arise in a number of scientific and engineering applications. In this paper we describe the implementation of a family of disk based linear equation solvers and the required characteristics of the I/O system needed to support them.

Keywords: parallel I/O, scientific computing, matrix factorization, Intel, pario-bib

Comment: Invited speaker. See also scott:solvers. This gives a very brief overview of Intel's block solver and slab solver, both out-of-core linear-systems solvers. He notes a few optimizations that had to be made to CFS to make it work: data and metadata needed to have equal priority in the cache, because often the (higher-priority) metadata was crowding out the data; and they had to restrict some files to small subsets of disks to reduce the contention for the cache at each I/O node caused by large groups of processors all requesting at the same time (see nitzberg:cfs for the same problem).

scott:solvers:
David S. Scott. Out of core dense solvers on Intel parallel supercomputers. In Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation, pages 484-487, 1992.

Keywords: parallel I/O, scientific computing, Intel, pario-bib

Comment: He discusses ProSolver-DES, which factors large matrices by swapping square submatrices in and out of memory, and Intel's new solver, which swaps column blocks in and out. The new solver is a little slower, but allows full pivoting, which is needed for stability in some matrices. A short paper with little detail. Some performance numbers. See scott:matrix.

seamons:compressed:
K. E. Seamons and M. Winslett. A data management approach for handling large compressed arrays in high performance computing. In Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pages 119-128, February 1995.

Keywords: parallel I/O, pario-bib

Comment: ``This paper shows how compression can be used to speed up parallel i/o of large arrays. The current version of the paper focuses on improving write performance.'' They use chunked files like in seamons:interface but before writing they compress each chunk on its compute node, and after reading they decompress each chunk on its compute node. Presumably this is only useful when you plan to read back whole chunks. They find better performance for compressing in many cases, even when the compression time dominates the I/O time, because it reduces the I/O time so much. They found that the compression time and compression ratio can vary widely from chunk to chunk, leading to a tremendous load imbalance that unfortunately spoils some of the advanatages if all compute nodes must wait for the slowest to finish.

seamons:interface:
K. E. Seamons and M. Winslett. An efficient abstract interface for multidimensional array I/O. In Proceedings of Supercomputing '94, pages 650-659, Washington, DC, November 1994. IEEE Computer Society Press.
See also later version seamons:jpanda.

Keywords: parallel I/O, multiprocessor file system interface, pario-bib

Comment: ``This paper shows what large performance gains can be made for parallel i/o of large arrays by using a carefully implemented library interface for i/o that makes use of array chunking. For example, the authors obtained a factor of 10 speedup in output of time step data by using the natural array chunks of the problem decomposition as the units of i/o on an Intel iPSC/860. The paper also presents results from experiments with the use of chunking in checkpointing and restarts on parallel architectures, and the use of chunking with memory-mapped data files in visualization on sequential architectures.'' They describe a library that supports chunked representations of matrices. That is, ways to checkpoint, output, or input multidimensional matrices to files in a blocked rather than row-major or column-major layout. This helps the file be more versatile for reading in a variety of dimensions. Their experiments show good performance improvements, although they only tried it for an application whose data set in memory was already in a blocked distribution - I would guess that smaller improvements might come from column- or row-oriented memory distributions. Also, some of their performance improvement came from characteristics specific to the Intel CFS file system, having to do with its IOP-cache management policies. See also seamons:schemas and seamons:compressed.

seamons:jpanda:
Kent E. Seamons and Marianne Winslett. Multidimensional array I/O in Panda 1.0. Journal of Supercomputing, 10(2):191-211, 1996.
See also earlier version seamons:interface.

Keywords: parallel I/O, collective I/O, pario-bib

seamons:msio:
K. E. Seamons, Y. Chen, M. Winslett, Y. Cho, S. Kuo, P. Jones, J. Jozwiak, and M. Subramanian. Fast and easy I/O for arrays in large-scale applications, October 1995. At SPDP'95.

Abstract: This four-page paper, written for an audience from the supercomputing/parallel i/o community, is a nice succinct introduction to Panda. Abstract and summary:

Scientists with high-performance computing needs are plagued by applications suffering poor i/o performance and are burdened with the need to consider low-level physical storage details of persistent arrays in order to reach acceptable i/o performance levels, especially with existing parallel i/o facilities. The Panda i/o library (URL http://bunny.cs.uiuc.edu/CADR/panda.html) serves as a concrete example of a methodology for freeing application developers from unnecessary storage details through high-level abstract interfaces and providing them with increased performance and greater portability.

Panda addresses these problems by introducing high-level application program interfaces for array i/o on both parallel and sequential machines, and by developing an efficient commodity-parts-based implementation of those interfaces across a variety of computer architectures. It is costly to build a file system from scratch and we designed Panda to run on top of existing commodity file systems such as AIX; excellent performance using this approach implies immediate and broad applicability. High-level interfaces provide ease of use, application portability, and, most importantly, allow plenty of flexibility for an efficient underlying implementation. A high-level view of an entire i/o operation, made possible with Panda's high level interfaces, allows Panda to optimize reading and writing arrays to the host file system on the i/o nodes using Panda's server-directed i/o architecture.

Panda focuses specifically on multidimensional arrays, the data type at the root of i/o performance problems in scientific computing. The Panda i/o library exhibits excellent performance on the NASA Ames NAS IBM SP2, attaining 83-98% of peak AIX performance on each i/o node in the experiments described in this paper. We expect high-level interfaces such as Panda's to become the interfaces of choice for scientific applications in the future. As Panda can be easily added on top of existing parallel file systems and ordinary file systems without changing them, Panda illustrates a way to obtain cheap, fast, and easy-to-use i/o for high-performance scientific applications.

Keywords: parallel I/O, scientific computing, pario-bib

Comment: Just a short 4-page summary of the Panda I/O library, including some brief performance results.

seamons:panda:
K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-directed collective I/O in Panda. In Proceedings of Supercomputing '95, San Diego, CA, December 1995. IEEE Computer Society Press.

Abstract: We present the architecture and implementation results for Panda 2.0, a library for input and output of multidimensional arrays on parallel and sequential platforms. Panda achieves remarkable performance levels on the IBM SP2, showing excellent scalability as data size increases and as the number of nodes increases, and provides throughputs close to the full capacity of the AIX file system on the SP2 we used. We argue that this good performance can be traced to Panda's use of server-directed i/o (a logical-level version of disk-directed i/o [Kotz94b]) to perform array i/o using sequential disk reads and writes, a very high level interface for collective i/o requests, and built-in facilities for arbitrary rearrangements of arrays during i/o. Other advantages of Panda's approach are ease of use, easy application portability, and a reliance on commodity system software.

Keywords: collective I/O, parallel I/O, pario-bib

Comment: This rewrite of Panda (see seamons:interface) is in C++ and runs on the SP2. They provide simple ways to declare the distribution of your array in memory and on disk, to form a list of arrays to be output at each timestep or at each checkpoint, and then to call for a timestep or checkpoint. Then they use something like disk-directed I/O (kotz:jdiskdir) internally to accomplish the rearrangement and transfer of data from compute nodes to I/O nodes. Note proceedings only on CD-ROM and WWW.

seamons:schemas:
K. E. Seamons and M. Winslett. Physical schemas for large multidimensional arrays in scientific computing applications. In Proceedings of the 7th International Working Conference on Scientific and Statistical Database Management, pages 218-227, September 1994.

Keywords: parallel I/O, scientific database, scientific computing, pario-bib

Comment: ``This paper presents PANDA's high-level interfaces for i/o operations, including checkpoint, restart, and time step output, and explains the rationale behind them.'' Basically they provide a bit of detail for the file formats they use in seamons:interface

seamons:thesis:
Kent E. Seamons. Panda: Fast Access to Persistent Arrays Using High Level Interfaces and Server Directed Input/Output. PhD thesis, University of Illinois at Urbana-Champaign, May 1996.

Abstract: Multidimensional arrays are a fundamental data type in scientific computing and are used extensively across a broad range of applications. Often these arrays are persistent, i.e., they outlive the invocation of the program that created them. Portability and performance with respect to input and output (i/o) pose significant challenges to applications accessing large persistent arrays, especially in distributed-memory environments. A significant number of scientific applications perform conceptually simple array i/o operations, such as reading or writing a subarray, an entire array, or a list of arrays. However, the algorithms to perform these operations efficiently on a given platform may be complex and non-portable, and may require costly customizations to operating system software.

This thesis presents a high-level interface for array i/o and three implementation architectures, embodied in the Panda (Persistence AND Arrays) array i/o library. The high-level interface contributes to application portability, by encapsulating unnecessary details and being easy to use. Performance results using Panda demonstrate that an i/o system can provide application programs with a high-level, portable, easy-to-use interface for array i/o without sacrificing performance or requiring custom system software; in fact, combining all these benefits may only be possible through a high-level interface due to the great freedom and flexibility a high-level interface provides for the underlying implementation.

The Panda server-directed i/o architecture is a prime example of an efficient implementation of collective array i/o for closely synchronized applications in distributed-memory single-program multiple-data (SPMD) environments. A high-level interface is instrumental to the good performance of server-directed i/o, since it provides a global view of an upcoming collective i/o operation that Panda uses to plan sequential reads and writes. Performance results show that with server-directed i/o, Panda achieves throughputs close to the maximum AIX file system throughput on the i/o nodes of the IBM SP2 when reading and writing large multidimensional arrays.

Keywords: parallel I/O, persistent data, parallel computing, pario-bib

Comment: see also chen:panda, seamons:panda, seamons:compressed, seamons:interface, seamons:schemas, seamons:msio, seamons:jpanda

segawa:pvfs-pm:
Koji Segawa, Osamu Tatebe, Yuetsu Kodama, Tomohiro Kudoh, and Toshiyuki Shimizu. Design and implementation of PVFS-PM: a cluster file system on SCore. In Workshop on Parallel I/O in Cluster Computing and Computational Grids, pages 705-711, Tokyo, May 2003. National Institute of Advanced Industrial Science and Technology (AIST), IEEE Computer Society Press. Organized at the IEEE/ACM International Symposium on Cluster Computing and the Grid 2003.

Abstract: This paper discusses the design and implementation of a cluster file system, called PVFS-PM, on the SCore cluster system software. This is the first attempt to implement a cluster file system on the SCore system. It is based on the PVFS cluster file system but replaces TCP with the PMv2 communication library supported by SCore to provide a scalable, high-performance cluster file system. PVFS-PM improves the performance by factors of 1.07 and 1.93 for writing and reading,respectively, with 8 I/O nodes, compared with the original PVFS on TCP on a Gigabit Ethernet-connected SCore cluster.

Keywords: parallel I/O, pario-bib

shah:algorithms:
Rahul Shah, Peter J. Varman, and Jeffrey Scott Vitter. Online algorithms for prefetching and caching on parallel disks. In Proceedings of the Sixteenth Symposium on Parallel Algorithms and Architectures, volume 16, pages 255-264, Barcelona, Spain, June 2004.

Abstract: Parallel disks provide a cost effective way of speeding up I/Os in applications that work with large amounts of data. The main challenge is to achieve as much parallelism as possible, using prefetching to avoid bottlenecks in disk access. Efficient algorithms have been developed for some particular patterns of accessing the disk blocks, In this paper, we consider general request sequences. When the request sequence consists of unique block requests, the problem is called prefetching and is a well-solved problem for arbitrary request sequences. When the reference sequence can have repeated references to the same block, we need to devise an effective caching policy as well. While optimum offline algorithms have been recently designed for the problem, in the online case, no effective algorithm was previously known. Our main contribution is a deterministic online algorithm threshold-LRU which achieves O((MD/L) {sup 2/3}) competitive ratio and a randomized online algorithm threshold-MARK which achieves O({square root}(MD/L) log(MD/L)) competitive ratio for the caching/prefetching problem on the parallel disk model (PDM), where D is the number of disks, M is the size of fast memory buffer, and M + L is the amount of lookahead available in the request sequence. The best-known lower bound on the competitive ratio is {Omega}( {square root}MD/L) for lookahead L GRE M in both models. We also show that if the deterministic online algorithm is allowed to have twice the memory of the offline then a tight competitive ratio of {Theta}( {square root}MD/L) can be achieved. This problem generalizes the well-known paging problem on a single disk to the parallel disk model.

Keywords: online algorithms, prefetching, caching, parallel disk model, threshold LRU, pario-bib

shen:data-management:
X. H. Shen, W. K. Liao, A. Chouldhary, G. Memik, and M. Kandemir. A high-performance application data environment for large-scale scientific computations. IEEE Transactions on Parallel and Distributed Systems, 14(12):1262-1274, December 2003.

Abstract: Effective high-level data management is becoming an important issue with more and more scientific applications manipulating huge amounts of secondary-storage and tertiary-storage data using parallel processors. A major problem facing the current solutions to this data management problem is that these solutions either require a deep understanding of specific data storage architectures and file layouts to obtain the best performance (as in high-performance storage management systems and parallel file systems), or they sacrifice significant performance in exchange for ease-of-use and portability (as in traditional database management systems). We discuss the design, implementation, and evaluation of a novel application development environment for scientific computations. This environment includes a number of components that make it easy for the programmers to code and run their applications without much programming effort and, at the same time, to harness the available computational and storage power on parallel architectures. (39 refs.)

Keywords: data management, scientific applications, workflow, parallel file systems, pario-bib

shen:dpfs:
Xiaohui H. Shen and Alok Choudhary. A high-performance distributed parallel file system for data-intensive computations. Journal of Parallel and Distributed Computing, 64(10):1157-1167, September 2004.

Abstract: One of the challenges brought by large-scale scientific applications is how to avoid remote storage access by collectively using sufficient local storage resources to hold huge amounts of data generated by the simulation while providing high-performance I/O. DPFS, a distributed parallel file system, is designed and implemented to address this problem. DPFS collects locally distributed and unused storage resources as a supplement to the internal storage of parallel computing systems to satisfy the storage capacity requirement of large-scale applications. In addition, like parallel file systems, DPFS provides striping mechanisms that divide a file into small pieces and distributes them across multiple storage devices for parallel data access. The unique feature of DPFS is that it provides three file levels with each file level corresponding to a file striping method. In addition to the traditional linear striping method, DPFS also provides a novel Multidimensional striping method that can solve performance problems of linear striping for many popular access patterns. Other issues such as load-balancing and user interface are also addressed in DPFS. (C) 2004 Elsevier Inc. All rights reserved.

Keywords: distributed file system, parallel file system, striping, pario-bib

shi:dma-raid:
Zhan Shi, Jiangling Zhang, and Xinrong Zhou. Using DMA aligned buffer to improve software raid performance. Lecture Notes in Computer Science, 3038:355-362, June 2004.

Abstract: While the storage market grows rapidly, software RAID, as a low-cost solution, becomes more and more important nowadays. However the performance of software RAID is greatly constrained by its implementation. Varies methods have been taken to improve its performance. By integrating a novel buffer mechanism - DMA aligned buffer (DAB) into software RAID kernel driver, we achieved a significant performance improvement, especially on small I/O requests.

Keywords: DMA, software RAID, performance, DMA aligned buffer, DAB, pario-bib

shieh:dsm-pario:
Ce-Kuen Shieh, Su-Cheong Mac, and Jyh-Chang Ueng. Improving the performance of distributed shared memory systems via parallel file input/output. Journal of systems and software, 44(1):3-15, December 1998.

Keywords: distributed shared memory, parallel I/O, file I/O, file system, virtual memory, pario-bib

Comment: A parallel-I/O scheme for a system using DSM, which has one disk per node. The file is initiaally placed on node 0 , Application runs once, system then collects informaton about access pattern. The file is redistributed across all disks. Application must do all file accesses from node 0, but in subsequent runs this causes the block to be read from its disk into the local memory of the attached node, and VM-mapped into the correct place. Later page faults will move the data to the node needing the data first (if the redistribution is done well, that's the same node, so no movement is needed). At the end of the program, output data are written to the output file, on the local disk. Thus: input files go to node 0 on the first run, then are redistributed before second run, and output files are created across all nodes but are written only at file close and only to the closest disk. Limitations: files must be wholly read during application initialization, from node 0. Files must be wholly written out during the application completion. Files are immutable. You must have one slow run initially. Input files must fit on one disk. I read sections 1-2, then skimmed the rest.

shin:hartsio:
Kang G. Shin and Greg Dykema. A distributed I/O architecture for HARTS. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 332-342, 1990.

Keywords: parallel I/O, multiprocessor architecture, MIMD, fault tolerance, pario-bib

Comment: HARTS is a multicomputer connected with a wrapped hexagonal mesh, with an emphasis on real-time and fault tolerance. The mesh consists of network routing chips. Hanging off each is a small bus-based multiprocessor ``node''. They consider how to integrate I/O devices into this architecture: attach device controllers to processors, to network routers, to node busses, or via a separate network. They decided to compromise and hang each I/O controller off three network routers, in the triangles of the hexagonal mesh. This keeps the traffic off of the node busses, and allows multiple paths to each controller. They discuss the reachability and hop count in the presence of failed nodes and links.

shirriff:sawmill:
Ken Shirriff and John Ousterhout. Sawmill: A high-bandwidth logging file system. In Proceedings of the 1994 Summer USENIX Technical Conference, pages 125-136, 1994.

Keywords: file system, parallel I/O, pario-bib, RAID

Comment: This is a file system based on LFS and run on the RAID-II prototype (see drapeau:raid-ii). It uses the RAID-II controller's memory (32 MB) to pipeline data transfers from the RAID disks directly to (from) the network. Thus, data never flows through the server CPU or memory. The server remains in control, telling the controller where each block goes, etc. They get very high data rates. And despite being much faster than the RAID for small writes, they were still CPU-limited, because the CPU had to handle all the little requests.

shock:database:
Carter T. Shock, Chialin Chang, Bongki Moon, Anurag Acharya, Larry Davis, Joel Saltz, and Alan Sussman. The design and evaluation of a high-performance earth science database. Parallel Computing, 24(1):65-89, January 1998.

Keywords: parallel I/O, database, pario-bib

Comment: Part of a special issue.

shriver:api-tr:
Elizabeth A. M. Shriver and Leonard F. Wisniewski. An API for choreographing data accesses. Technical Report PCS-TR95-267, Dept. of Computer Science, Dartmouth College, November 1995.

Abstract: Current APIs for multiprocessor multi-disk file systems are not easy to use in developing out-of-core algorithms that choreograph parallel data accesses. Consequently, the efficiency of these algorithms is hard to achieve in practice. We address this deficiency by specifying an API that includes data-access primitives for data choreography. With our API, the programmer can easily access specific blocks from each disk in a single operation, thereby fully utilizing the parallelism of the underlying storage system. Our API supports the development of libraries of commonly-used higher-level routines such as matrix-matrix addition, matrix-matrix multiplication, and BMMC (bit-matrix-multiply/complement) permutations. We illustrate our API in implementations of these three high-level routines to demonstrate how easy it is to use.

Keywords: parallel I/O, multiprocessor file system interface, pario-bib

Comment: Also published as Courant Institute Tech Report 708.

shriver:models-algs:
Elizabeth Shriver and Mark Nodine. An introduction to parallel I/O models and algorithms. In Jain et al. [iopads-book], chapter 2, pages 31-68.

Abstract: Problems whose data are too large to fit into main memory are called out-of-core problems. Out-of-core parallel-I/O algorithms can handle much larger problems than in-memory variants and have much better performance than single-device variants. However, they are not commonly used-partly because the understanding of them is not widespread. Yet such algorithms ought to be growing in importance because they address the needs of users with ever-growing problem sizes and ever-increasing performance needs.

This paper addresses this lack of understanding by presenting an introduction to the data-transfer models on which most of the out-of-core parallel-I/O algorithms are based, with particular emphasis on the Parallel Disk Model. Sample algorithms are discussed to demonstrate the paradigms (algorithmic techniques) used with these models.

Our aim is to provide insight into both the paradigms and the particular algorithms described, thereby also providing a background for understanding a range of related solutions. It is hoped that this background would enable the appropriate selection of existing algorithms and the development of new ones for current and future out-of-core problems.

Keywords: parallel I/O algorithms, out-of-core, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

si-woong:cluster:
Jang Si-Woong, Chung Ki-Dong, and Sam Coleman. Design and implementation of a network-wide concurrent file system in a workstation cluster. In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 239-245. IEEE Computer Society Press, September 1995.

Abstract: We estimate the performance of a network-wide concurrent file system implemented using conventional disks as disk arrays. Tests were carried out on both single system and network-wide environments. On single systems, a file was split across several disks to test the performance of file I/O operations. We concluded that performance was proportional to the number of disks, up to four, on a system with high computing power. Performance of a system with low computing power, however, did not increase, even with more than two disks. When we split a file across disks in a network-wide system called the Network-wide Concurrent File System (N-CFS), we found performance similar to or slightly higher than that of disk arrays on single systems. Since file access through N-CFS is transparent, this system enables traditional disks on single and networked systems to be used as disk arrays for I/O intensive jobs.

Keywords: mass storage, cluster computing, distributed file system, parallel I/O, pario-bib

sicola:storageworks:
Stephen J. Sicola. The architecture and design of HS-series StorageWorks array controllers. Digital Technical Journal, 6(4):5-25, Fall 1994.

Keywords: disk controller, RAID, parallel I/O, pario-bib

Comment: Describes the RAID controller for the DEC StorageWorks product.

simitci:patterns:
Huseyin Simitci and Daniel Reed. A comparison of logical and physical parallel I/O patterns. The International Journal of High Performance Computing Applications, 12(3):364-380, Fall 1998.

Abstract: Although there are several extant studies of parallel scientific application request patterns, there is little experimental data on the correlation of physical input/output patterns with application input/output stimuli. To understand these correlations, we have instrumented the SCSI device drivers of the Intel Paragon OSF/1 operating system to record key physical input/output activities and have correlated this data with the input/output patterns of scientific applications captured via the Pablo analysis toolkit. Our analysis shows that disk hardware features profoundly affect the distribution of request delays and that current parallel file systems respond to parallel application input/output patterns in non-scalable ways.

Keywords: parallel I/O application, pario-bib

Comment: In a Special Issue on I/O in Parallel Applications, volume 12, numbers 3 and 4.

simitci:striping:
Huseyin Simitci and Daniel A. Reed. Adaptive disk striping for parallel input/output. In Proceedings of the Seventh NASA Goddard Conference on Mass Storage Systems and Technologies, pages 88-102, San Diego, CA, March 1999. IEEE Computer Society Press.

Keywords: adaptive striping, disk striping, parallel I/O, pario-bib

sinclair:instability:
James B. Sinclair, Jay Tang, and Peter J. Varman. Instability in parallel I/O systems. In Proceedings of the IPPS '94 Workshop on Input/Output in Parallel Computer Systems, pages 16-35. Rice University, April 1994. Also appeared in Computer Architecture News 22(4).
See also later version sinclair:instability-book.

Keywords: parallel I/O, pario-bib

Comment: They study the performance of a parallel I/O system when several concurrent processes are accessing a shared set of disks, using a common buffer pool. They found that under certain circumstances the system can become unstable, in that some subset of processes monopolize all of the resources, bringing the others to a virtual halt. They use analytical models to show that instability can occur if every process has distinct input and output disks, reads are faster than writes, disk scheduling policy of a certain class, and processes don't wait for other resources.

sinclair:instability-book:
J. B. Sinclair, J. Tang, and P. J. Varman. Placement-related problems in shared disk I/O. In Jain et al. [iopads-book], chapter 12, pages 271-289.
See also earlier version sinclair:instability.

Abstract: In a shared-disk parallel I/O system, several processes may be accessing the disks concurrently. An important example is concurrent external merging arising in database management systems with multiple independent sort queries. Such a system may exhibit instability, with one of the processes racing ahead of the others and monopolizing I/O resources. This race can lead to serialization of the processes and poor disk utilization, even when the static load on the disks is balanced. The phenomenon can be avoided by proper layout of data on the disks, as well as through other I/O management strategies. This has implications for both data placement in multiple disk systems and task partitioning for parallel processing.

Keywords: parallel I/O, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

sinclair:placement:
J. B. Sinclair, J. Tang, P. J. Varman, and B. R. Iyer. Impact of data placement on parallel I/O systems. In Proceedings of the 1993 International Conference on Parallel Processing, pages III-276-279, St. Charles, IL, 1993. CRC Press.

Keywords: parallel I/O, pario-bib

Comment: Several external merges (many sorted runs into one) are concurrently in action. Where do you put their input and output runs, that is, on which disks? Only input runs are striped, and usually on a subset of disks.

singh:adopt:
Tarvinder Pal Singh and Alok Choudhary. ADOPT: A dynamic scheme for optimal prefetching in parallel file systems. Technical report, NPAC, June 1994.

Keywords: parallel I/O, pario-bib

Comment: They describe a prefetching scheme where hints can be provided from the programmer, compiler, or runtime library to the I/O node. These hints seem to take the form of a sequence (all in order) or a set (only one of many, from conditional expressions). The hints come from each process, not collectively. Then, the I/O node keeps these specifications and uses them to drive prefetching when there is no other work to do. They rotate among the specifications of many processes. Later they hope to examine more complex scheduling strategies and buffer-space allocation strategies.

sivathanu:dgraid:
Muthian Sivathanu, Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Improving storage system availability with D-GRAID. In Proceedings of the USENIX FAST '04 Conference on File and Storage Technologies, pages 15-30, San Francisco, CA, March 2004. University of Wisconsin, Madison, USENIX Association.

Abstract: We present the design, implementation, and evaluation of D-GRAID, a gracefully-degrading and quickly-recovering RAID storage array. D-GRAID ensures that most files within the file system remain available even when an unexpectedly high number of faults occur. D-GRAID also recovers from failures quickly, restoring only live file system data to a hot spare. Both graceful degradation and live-block recovery are implemented in a prototype SCSI-based storage system underneath unmodified file systems, demonstrating that powerful "file-system like" functionality can be implemented behind a narrow block-based interface.

Keywords: fault tolerance, disk failure, RAID, D-GRAID, pario-bib

Comment: Awarded best student paper.

smirni:bevolutionary:
Evgenia Smirni, Ruth A. Aydt, Andrew A. Chien, and Daniel A. Reed. I/O requirements of scientific applications: An evolutionary view. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 40, pages 576-594. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version smirni:evolutionary.

Keywords: I/O, workload characterization, scientific computing, parallel I/O, pario-bib

Comment: Part of jin:io-book, modified from smirni:evolutionary.

smirni:evolutionary:
Evgenia Smirni, Ruth A. Aydt, Andrew A. Chien, and Daniel A. Reed. I/O requirements of scientific applications: An evolutionary view. In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, pages 49-59, Syracuse, NY, 1996. IEEE Computer Society Press.
See also later version smirni:bevolutionary.

Abstract: The modest I/O configurations and file system limitations of many current high-performance systems preclude solution of problems with large I/O needs. I/O hardware and file system parallelism is the key to achieving high performance. We analyze the I/O behavior of several versions of two scientific applications on the Intel Paragon XP/S. The versions involve incremental application code enhancements across multiple releases of the operating system. Studying the evolution of I/O access patterns underscores the interplay between application access patterns and file system features. Our results show that both small and large request sizes are common, that at present, application developers must manually aggregate small requests to obtain high disk transfer rates, that concurrent file accesses are frequent, and that appropriate matching of the application access pattern and the file system access mode can significantly increase application I/O performance. Based on these results, we describe a set of file system design principles.

Keywords: I/O, workload characterization, scientific computing, parallel I/O, pario-bib

Comment: They study two applications over several versions, using Pablo to capture the I/O activity. They thus watch as application developers improve the applications use of I/O modes and request sizes. Both applications move through three phases: initialization, computation (with out-of-core I/O or checkpointing I/O), and output. They found it necessary to tune the I/O request sizes to match the parameters of the I/O system. In the initial versions, the code used small read and write requests, which were (according to the developers) the "easiest and most natural implementation for their I/O." They restructured the I/O to make bigger requests, which better matched the capabilities of Intel PFS. They conclude that asynchronous and collective operations are imperative. They would like to see a file system that can adapt dynamically to adjust its policies to the apparent access patterns. Automatic request aggregation of some kind seems like a good idea; of course, that is one feature of a buffer cache.

smirni:lessons:
E. Smirni and D.A. Reed. Lessons from characterizing the input/output behavior of parallel scientific applications. Performance Evaluation: An International Journal, 33(1):27-44, June 1998.
See also earlier version smirni:workload.

Abstract: As both processor and interprocessor communication hardware is evolving rapidly with only moderate improvements to file system performance in parallel systems, it is becoming increasingly difficult to provide sufficient input/output (I/O) performance to parallel applications. I/O hardware and file system parallelism are the key to bridging this performance gap. Prerequisite to the development of efficient parallel file systems is the detailed characterization of the I/O demands of parallel applications. In the paper, we present a comparative study of parallel I/O access patterns, commonly found in I/O intensive scientific applications. The Pablo performance analysis tool and its I/O extensions is a valuable resource in capturing and analyzing the I/O access attributes and their interactions with extant parallel I/O systems. This analysis is instrumental in guiding the development of new application programming interfaces (APIs) for parallel file systems and effective file system policies that respond to complex application I/O requirements.

Keywords: workload characterization, parallel I/O, scientific applications, pario-bib

Comment: This paper compares the I/O performance of five scientific applications from the scalable I/O initiative (SIO) suite of applications. Their goals are to collect detailed performance data on applications characteristics and access patterns and to use that information to design and evaluate parallel file system policies and parallel file system APIs. The related work section gives a nice overview of recent I/O characterization studies. They use the Pablo reed:pablo performance analysis environment to analyze the performance of their five applications. The applications they chose to evaluate include: MESSKIT and NWChem, two implementations of the Hartree-Fock method for computational chemistry applications; QCRD, a quantum chemical reaction dynamics application; PRISM, a parallel 3D numerical simulation of the Navier-Stokes equations that models high speed turbulent flow that is periodic in one direction; ECAT, a parallel implementation of the Schwinger multichannel method used to calculate low-energy electron molecule collisions.

The results showed that applications use a combination of both sequential and interleaved access patterns, which shows that there is a clear need for a more complex API than what is given by the standard UNIX API. In addition, when applications required concurrent accesses, they commonly channeled all I/O requests through a single node. Some form of collective I/O would have helped in these cases. They also made an observation that despite the existence of several parallel I/O APIs, programmers of scientific applications preferred to use standard unix. This is mostly due to the lack of an established portable standard. Their study was "instrumental in the design and implementation of MPI-IO".

Their section on emerging I/O APIs is particularly interesting. They comment that "the diversity of I/O request sizes and patterns suggests that achieving high performance is unlikely with a single file system policy." Their solution is to have a file system in which the user can give "hints" to the file system expressing expected access patterns or to have a file system that automatically classifies access patterns. The file system can then chose policies to deal with the access patterns.

smirni:workload:
E. Smirni and D.A. Reed. Workload characterization of input/output intensive parallel applications. In Proceedings of the Conference on Modelling Techniques and Tools for Computer Performance Evaluation, volume 1245 of Lecture Notes in Computer Science, pages 169-180. Springer-Verlag, June 1997.
See also later version smirni:lessons.

Abstract: The broadening disparity in the performance of input/output (I/O) devices and the performance of processors and communication links on parallel systems is a major obstacle to achieving high performance for a wide range of parallel applications. I/O hardware and file system parallelism are the keys to bridging this performance gap. A prerequisite to the development of efficient parallel file systems is detailed characterization of the I/O demands of parallel applications. In this paper, we present a comparative study of the I/O access patterns commonly found in I/O intensive parallel applications. Using the Pablo performance analysis environment and its I/O extensions we captured application I/O access patterns and analyzed their interactions with current parallel I/O systems. This analysis has proven instrumental in guiding the development of new application programming interfaces (APIs) for parallel file systems and in developing effective file system policies that can adaptively respond to complex application I/O requirements.

Keywords: parallel I/O, pario-bib

Comment: see smirni:lessons

smotherman:taxonomy:
Mark Smotherman. A sequencing-based taxonomy of I/O systems and review of historical machines. Computer Architecture News, 17(5):5-15, September 1989.

Keywords: I/O architecture, historical summary, pario-bib

Comment: Classifies I/O systems by how they initiate and terminate I/O. Uniprocessor and Multiprocessor systems.

snir:hpfio:
Marc Snir. Proposal for IO. Posted to HPFF I/O Forum, August 31, 1992. Second Draft.

Keywords: parallel I/O, multiprocessor file system interface, pario-bib

Comment: An outline of two possible ways to specify mappings of arrays to storage nodes in a multiprocessor, and to make unformatted parallel transfers of multiple records. Seems to apply only to arrays, and to files that hold only arrays. It keeps the linear structure of files as sequences of records, but in some cases does not preserve the order of data items or of fields within subrecords. Tricky to understand unless you know HPF and Fortran 90.

sobti:personalraid:
Sumeet Sobti, Nitin Garg, Chi Zhang, Xiang Yu, Arvind Krishnamurty, and Randolph Y. Wang. PersonalRAID: Mobile storage for distributed and disconnected computers. In Proceedings of the USENIX FAST '02 Conference on File and Storage Technologies, pages 159-174, Monterey, CA, January 2002. USENIX Association.

Abstract: This paper presents the design and implementation of a mobile storage system called a PersonalRAID. PersonalRAID manages a number of disconnected storage devices. At the heart of a PersonalRAID system is a mobile storage device that transparently propagates data to ensure eventual consistency. Using this mobile device, a PersonalRAID provides the abstraction of a single coherent storage name space that is available everywhere, and it ensures reliability by maintaining data redundancy on a number of storage devices. One central aspect of the PersonalRAID design is that the entire storage system consists solely of a collection of storage logs; the log-structured design not only provides an efficient means for update propagation, but also allows efficient direct I/O accesses to the logs without incurring unnecessary log replay delays. The PersonalRAID prototype demonstrates that the system provides the desired transparency and reliability functionalities without imposing any serious performance penalty on a mobile storage user.

Keywords: file systems, pario-bib

soloviev:prefetching:
Valery V. Soloviev. Prefetching in segmented disk cache for multi-disk systems. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 69-82, Philadelphia, May 1996. ACM Press.

Abstract: This paper investigates the performance of a multi-disk storage system equipped with a segmented disk cache processing a workload of multiple relational scans. Prefetching is a popular method of improving the performance of scans. Many modern disks have a multisegment cache which can be used for prefetching. We observe that, exploiting declustering as a data placement method, prefetching in a segmented cache causes a load imbalance among several disks. A single disk becomes a bottleneck, degrading performance of the entire system. A variation in disk queue length is a primary factor of the imbalance. Using a precise simulation model, we investigate several approaches to achieving better balancing. Our metrics are a scan response time for the closed-end system and an ability to sustain a workload without saturating for the open-end system. We arrive at two main conclusions: (1) Prefetching in main memory is inexpensive and effective for balancing and can supplement or substitute prefetching in disk cache. (2) Disk-level prefetching provides about the same performance as main memory prefetching if request queues are managed in the disk controllers rather than in the host. Checking the disk cache before queuing requests provides not only better request response time but also drastically improves balancing. A single cache performs better than a segmented cache for this method.

Keywords: parallel I/O, prefetching, disk cache, disk array, pario-bib

Comment: An interesting paper about disk-controller cache management in database workloads. Actually, the workloads are sequential scans of partitioned files, which could occur in many kinds of workloads. The declustering pattern (partitioning) is a little unusual for most scientific parallel I/O veterans, who are used to striping. And the cache-management algorithms seem a bit strange, particularly the fact that the cache appears to be used only for explicit prefetch requests. Turns out that it is best to put the prefetching and disk queueing in the same place, either on the controller or in main memory, to avoid load imbalance that arises from randomness in the workload, which is accentuated into a big bottleneck and a convoy effect.

soltis:gfs:
Steven R. Soltis, Thomas M. Ruwart, and Matthew T. O'Keefe. The Global File System. In Proceedings of the Fifth NASA Goddard Conference on Mass Storage Systems and Technologies, pages 319-342, College Park, MD, September 1996. IEEE Computer Society Press.
See also later version soltis:bgfs.

Keywords: distributed file system, data storage, mass storage, network-attached disks, disk striping, parallel I/O, pario-bib

Comment: see also preslan:gfs

solworth:mirror:
John A. Solworth and Cyril U. Orji. Distorted mirrors. In Proceedings of the First International Conference on Parallel and Distributed Information Systems, pages 10-17, December 1991.
See also later version solworth:mirror2.

Keywords: disk mirroring, parallel I/O, pario-bib

Comment: Write one disk (the master) in the usual way, and write the slave disk at the closest free block. Actually, they propose to logically partition the two disks so that each disk has a master partition and a slave partition. Up to 80% improvement in small-write performance, while retaining good sequential read performance.

solworth:mirror2:
John A. Solworth and Cyril U. Orji. Distorted mirrors. Journal of Distributed and Parallel Databases, 1(1):81-102, January 1993.
See also earlier version solworth:mirror.

Keywords: disk mirroring, parallel I/O, pario-bib

Comment: See solworth:mirror.

spencer:pipeline:
M. Spencer, R. Ferreira, M. Beynon, T. Kurc, U. Catalyurek, A. Sussman, and J. Saltz. Executing multiple pipelined data analysis operations in the grid. In Proceedings of SC2002: High Performance Networking and Computing, Baltimore, Maryland, November 2002.

Abstract: Processing of data in many data analysis applications can be represented as an acyclic, coarse grain data flow, from data sources to the client. This paper is concerned with scheduling of multiple data analysis operations, each of which is represented as a pipelined chain of processing on data. We define the scheduling problem for effectively placing components onto Grid resources, and propose two scheduling algorithms. Experimental results are presented using a visualization application.

Keywords: DataCutter, pipeline, dataflow, pario-bib

srinilta:strategies:
Chutimet Srinilta, Divyesh Jadav, and Alok Choudhary. Design and evaluation of data storage and retrieval strategies in a distributed memory continuous media server. In Proceedings of the Eleventh International Parallel Processing Symposium, pages 360-367, April 1997.

Abstract: High performance servers and high-speed networks will form the backbone of the infra-structure required for distributed multimedia information systems. Given that the goal of such a server is to support hundreds of interactive data streams simultaneously, various tradeoffs are possible with respect to the storage of data on secondary memory, and its retrieval therefrom. In this paper we identify and evaluate these tradeoffs. We evaluate the effect of varying the stripe factor and also the performance of batched retrieval of disk-resident data. We develop a methodology to predict the stream capacity of such a server. The evaluation is done for both uniform and skewed access patterns. Experimental results on the Intel Paragon computer are presented.

Keywords: threads, parallel I/O, pario-bib

stabile:disks:
James Joseph Stabile. Disk scheduling algorithms for a multiple disk system. Master's thesis, UC Davis, 1988.

Keywords: parallel I/O, parallel file system, disk mirroring, disk scheduling, pario-bib

Comment: Describes simulation based on model of disk access pattern. Multiple-disk system, much like in matloff:multidisk. Files stored in two copies, each on a separate disk, but there are more than two disks, so this differs from mirroring. He compares several disk scheduling algorithms. A variant of SCAN seems to be the best.

steenkiste:net:
Peter Steenkiste. A high-speed network interface for distributed-memory systems: Architecture and applications. ACM Transactions on Computer Systems, 15(1):75-109, February 1997.

Keywords: parallel computer architecture, interconnection network, network interface, distributed memory, systolic array, input/output, parallel I/O, pario-bib

Comment: See also steenkiste:interface, kung:network, hemy:gigabit, bornstein:reshuffle, and gross:io.

stockinger:dictionary:
Heinz Stockinger. Dictionary on parallel input/output. Master's thesis, Department of Data Engineering, University of Vienna, February 1998.

Keywords: dictionary, survey, parallel I/O, pario-bib

Comment: A tremendous resource.

stodolsky:blogging:
Daniel Stodolsky, Garth Gibson, and Mark Holland. Parity logging overcoming the small write problem in redundant disk arrays. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 5, pages 67-80. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version stodolsky:logging.

Keywords: parallel I/O, RAID, redundancy, reliability, disk array, pario-bib

Comment: Part of jin:io-book; reformatted version of stodolsky:logging.

stodolsky:jlogging:
Daniel Stodolsky, Mark Holland, William V. Courtright II, and Garth A. Gibson. Parity-logging disk arrays. ACM Transactions on Computer Systems, 12(3):206-235, August 1994.
See also earlier version stodolsky:logging.

Keywords: parallel I/O, RAID, redundancy, reliability, pario-bib

Comment: See stodolsky:logging. An in-between version is CMU-CS-94-170, stodolsky:logging-tr.

stodolsky:logging:
Daniel Stodolsky, Garth Gibson, and Mark Holland. Parity logging: Overcoming the small write problem in redundant disk arrays. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 64-75, 1993.
See also earlier version stodolsky:logging-tr.
See also later version stodolsky:jlogging.

Abstract: Parity encoded redundant disk arrays provide highly reliable, cost effective secondary storage with high performance for read accesses and large write accesses. Their performance on small writes, however, is much worse than mirrored disks - the traditional, highly reliable, but expensive organization for secondary storage. Unfortunately, small writes are a substantial portion of the I/O workload of many important, demanding applications such as on-line transaction processing. This paper presents parity logging, a novel solution to the small write problem for redundant disk arrays. Parity logging applies journalling techniques to substantially reduce the cost of small writes. We provide a detailed analysis of parity logging and competing schemes - mirroring, floating storage, and RAID level 5 - and verify these models by simulation. Parity logging provides performance competitive with mirroring, the best of the alternative single failure tolerating disk array organizations. However, its overhead cost is close to the minimum offered by RAID level 5. Finally, parity logging can exploit data caching much more effectively than all three alternative approaches.

Keywords: parallel I/O, RAID, redundancy, reliability, disk array, pario-bib

Comment: Cite stodolsky:jlogging. Earlier version is CMU-CS-93-200. Parity logging to improve small writes. Log all parity updates; when it fills, go redo parity disk. Actually distribute the parity and log across all disks. Performance is comparable to, or exceeding, mirroring. Also handling double failures.

stodolsky:logging-tr:
Daniel Stodolsky, Mark Holland, William V. Courtright II, and Garth A. Gibson. A redundant disk array architecture for efficient small writes. Technical Report CMU-CS-94-170, Carnegie Mellon University, July 1994. Revised from CMU-CS-93-200.
See also later version stodolsky:logging.

Abstract: Parity encoded redundant disk arrays provide highly reliable, cost effective secondary storage with high performance for reads and large writes. Their performance on small writes, however, is much worse than mirrored disks - the traditional, highly reliable, but expensive organization for second ary storage. Unfortunately, small writes are a substantial portion of the I/O workload of many impor tant, demanding applications such as on-line transaction processing. This paper presents parity logging, a novel solution to the small write problem for redundant disk arrays. Parity logging applies journalling techniques to substantially reduce the cost of small writes. We provide detailed models of parity logging and competing schemes - mirroring, floating storage, and RAID level 5 - and verify these models by simulation. Parity logging provides performance competitive with mirroring, but with capacity overhead close to the minimum offered by RAID level 5. Finally, parity logging can exploit data caching more effectively than all three alternative approaches.

Keywords: parallel I/O, disk array, RAID, redundancy, reliability, pario-bib

stone:query:
Harold S. Stone. Parallel querying of large databases: A case study. IEEE Computer, 20(10):11-21, October 1987.

Keywords: parallel I/O, database, SIMD, connection machine, pario-bib

Comment: See also IEEE Computer, Jan 1988, p. 8 and 10. Examines a database query that is parallelized for the Connection Machine. He shows that in many cases, a smarter serial algorithm that reads only a portion of the database (through an index) will be faster than 64K processors reading the whole database. Uses a simple model for the machines to show this. Reemphasizes the point of Boral and DeWitt that I/O is the bottleneck of a database machine, and that parallelizing the processing will not necessarily help a great deal.

stonebraker:bradd:
Michael Stonebraker and Gerhard A. Schloss. Distributed RAID- a new multiple copy algorithm. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 6, pages 81-89. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version stonebraker:radd.

Keywords: disk striping, reliability, pario-bib

Comment: Part of jin:io-book; reformatted version of stonebraker:radd.

stonebraker:radd:
Michael Stonebraker and Gerhard A. Schloss. Distributed RAID - A new multiple copy algorithm. In Proceedings of 6th International Data Engineering Conference, pages 430-437, 1990.
See also later version stonebraker:bradd.

Keywords: disk striping, reliability, pario-bib

Comment: This is about ``RADD'', a distributed form of RAID. Meant for cases where the disks are physically distributed around several sites, and no one controller controls them all. Much lower space overhead than any mirroring technique, with comparable normal-mode performance at the expense of failure-mode performance.

stonebraker:xprs:
Michael Stonebraker, Randy Katz, David Patterson, and John Ousterhout. The design of XPRS. Technical Report UCB/ERL M88/19, UC Berkeley, March 1988.

Keywords: parallel I/O, disk array, RAID, Sprite, disk architecture, database, pario-bib

Comment: Designing a DBMS for Sprite and RAID. High availability, high performance. Shared memory multiprocessor. Allocates extents to files that are a interleaved over a variable number of disks, and over a contiguous set of tracks on those disks.

subramaniam:msthesis:
Mahesh Subramaniam. Efficient implementation of server-directed i/o. Master's thesis, Dept. of Computer Science, University of Illinois, June 1996.

Abstract: Parallel computers are a cost effective approach to providing significant computational resources to a broad range of scientific and engineering applications. Due to the relatively lower performance of the I/O subsystems on these machines and due to the significant I/O requirements of these applications, the I/O performance can become a major bottleneck. Optimizing the I/O phase of these applications poses a significant challenge. A large number of these scientific and engineering applications perform simple operations on multidimensional arrays and providing an easy and efficient mechanism for implementing these operations is important. The Panda array I/O library provides simple high level interfaces to specify collective I/O operations on multidimensional arrays in a distributed memory single-program multiple-data (SPMD) environment. The high level information provided by the user through these interfaces allows the Panda array I/O library to produce an efficient implementation of the collective I/O request. The use of these high level interfaces also increases the portability of the application.

This thesis presents an efficient and portable implementation of the Panda array I/O library. In this implementation, standard software components are used to build the I/O library to aid its portability. The implementation also provides a simple, flexible framework for the implementation and integration of the various collective I/O strategies. The server directed I/O and the reduced messages server directed I/O algorithms are implemented in the Panda array I/O library. This implementation supports the sharing of the I/O servers between multiple applications by extending the collective I/O strategies. Also, the implementation supports the use of part time I/O nodes where certain designated compute nodes act as the I/O servers during the I/O phase of the application. The performance of this implementation of the Panda array I/O library is measured on the IBM SP2 and the performance results show that for read and write operations, the collective I/O strategies used by the Panda array I/O library achieve throughputs close to the maximum throughputs provided by the underlying file system on each I/O node of the IBM SP2.

Keywords: parallel I/O, multiprocessor file system, pario-bib

sun:dynamic:
Weitao T. Sun, Jiwu W. Shu, and Weimin M. Zheng. Dynamic file allocation in storage area networks with neural network prediction. Lecture Notes in Computer Science, 3174:719-724, June 2004.

Abstract: Disk arrays are widely used in Storage Area Networks (SANs) to achieve mass storage capacity and high level I/O parallelism. Data partitioning and distribution among the disks is a promising approach to minimize the file access time and balance the I/O workload. But disk I/O parallelism by itself does not guarantee the optimal performance of an application. The disk access rates fluctuate with time because of access pattern variations, which leads to a workload imbalance. The user access pattern prediction is of great importance to dynamic data reorganization between hot and cool disks. Data migration occurs according to current and future disk allocation states and access frequencies. The objective of this paper is to develop a neural network based disk allocation trend prediction method and optimize the disks' file capacity to their balanced level. A Levenberg-Marquardt neural network was adopted to predict the disk access frequencies with the I/O track. History. Data reorganization on disk arrays was optimized to provide a good workload balance. The simulation results proved that the proposed method performs well.

Keywords: SAN, dynamic data reorganization, neural network, access pattern prediction, pario-bib

taber:metadisk:
David Taber. MetaDisk driver technical description. SunFlash electronic mailing list 22(9), October 1990.

Keywords: disk mirroring, parallel I/O, pario-bib

Comment: MetaDisk is a addition to the Sun SPARCstation server kernel. It allows disk mirroring between any two local disk partitions, or concatenation of several disk partitions into one larger partition. Can span up to 4 partitions simultaneously. Appears not to be striped, just allows bigger partitions, and (by chance) some parallel I/O for large files.

takahashi:performance:
Naoua Takahashi and Yasuo Kurosu. Performance improvement of disk array subsystems having shared cache and control memories. Transactions of the Institute of Electronics, Information and Communication Engineers D-I, J86-D-I(6):375-388, June 2003.

Abstract: Disk array subsystems have serious demands for higher speed and greater number of channels along with the trends in improving operational efficiency of information system by integrating its storage subsystems. Conventional disk array subsystem employs a bus-structured connection between its microprocessors and shared cache and control memories. In general, a network-structured connection can be faster as compared with a bus-structured one although a switch causes higher latency. In this paper we propose a hybrid star-net connection consisting of a hierarchically switched star fan-out for cache memory and a direct star fan-out for control memory, where cache is used as a temporary store of host data, and control memory stores various control data including cache control tables. The latter requires more speed than the former. Based on the proposed connection, we developed a disk array subsystem with host interface having 32 channels, and evaluated its performance. We could attain sequential performance of 920MB/s and transaction performance of 160KIO/s. In comparison to the conventional disk array subsystem, the former is 5 times, and the latter is 2.5 times better. (12 refs.)

Keywords: disk array, star network topology, shared cache, pario-bib

talia:data-intensive:
Domenico Talia and Pradip K. Srimani. Parallel data-intensive algorithms and applications. Parallel Computing, 28(5):669-671, May 2002.

Keywords: parallel application, parallel I/O, pario-bib

Comment: guest editorial, no abstract

tan:pizzas:
Michael Tan, Nick Roussopoulos, and Steve Kelley. The Tower of Pizzas. Technical Report UMIACS-TR-95-52, University of Maryland Institute for Advanced Computer Studies (UMIACS), April 1995.

Abstract: CPU speeds are increasing at a much faster rate than secondary storage device speeds. Many important applications face an I/O bottleneck. We demonstrate that this bottleneck can be alleviated through 1) scalable striping of data and 2) caching/prefetching techniques. This paper describes the design and performance of the Tower of Pizzas (TOPs), a portable software system providing parallel I/O and buffering services.

Keywords: parallel I/O, pario-bib

Comment: Same as CS-TR-3462 from Department of Computer Science. Basically, a parallel file system for a workstation cluster using the usual parallel file-system ideas. They do support client-side caching, using a client-side server process which shares memory with the client. Otherwise nothing really new.

taylor:magic:
Herb Taylor, Danny Chin, and Stan Knight. The Magic video-on-demand server and real-time simulation system. IEEE Parallel and Distributed Technology, 3(2):40-51, Summer 1995.

Keywords: parallel I/O, multimedia, video on demand, pario-bib

Comment: They describe a video server system being developed at the Sarnoff Real Time Corporation. This paper describes their simulated system. It is intended as more than a video-on-demand system, but also for capture and processing as well as playback. So they have a complex system of interconnected SIMD boards, each with a high-speed link to various devices, including a collection of disk drives. Data is striped across disks. They integrate playback scheduling and the disk striping in an interesting way.

tennenhouse:debug:
Marsha Tennenhouse and Dror Zernik. Visual debugging of parallel file system programs. Technical report, IBM, March 1995.

Keywords: debugging, visualization, parallel file system, parallel I/O, pario-bib

tewari:bhigh:
Renu Tewari, Daniel M. Dias, Rajat Mukherjee, and Harrick M. Vin. High availability in clustered multimedia servers. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 38, pages 555-565. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version tewari:high.

Keywords: cluster, parallel I/O, pario-bib

Comment: Part of jin:io-book; reformatted version of tewari:high.

tewari:high:
Renu Tewari, Daniel M. Dias, Rajat Mukherjee, and Harrick M. Vin. High availability in clustered multimedia servers. In Proceedings of the Twelfth International Conference on Data Engineering, pages 645-654, 1996.
See also later version tewari:bhigh.

Abstract: Clustered multimedia servers, consisting of interconnected nodes and disks, have been proposed for large-scale servers that are capable of supporting multiple concurrent streams which access the video objects stored in the server. As the number of disks and nodes in the cluster increases, so does the probability of a failure. With data striped across all disks in a cluster, the failure of a single disk or node results in the disruption of many or all streams in the system. Guaranteeing high availability in such a cluster becomes a primary requirement to ensure continuous service. In this paper, we study mirroring and software RAID schemes with different placement strategies that guarantee high availability in the event of disk and node failures while satisfying the real-time requirements of the streams. We examine various declustering techniques for spreading the redundant information across disks and nodes and show that random declustering has good real-time performance. Finally, we compare the overall cost per stream for different system configurations. We derive the parameter space where mirroring and software RAID apply, and determine optimal parity group sizes

Keywords: cluster, parallel I/O, pario-bib

thakur:abstract:
Rajeev Thakur, William Gropp, and Ewing Lusk. An abstract-device interface for implementing portable parallel-I/O interfaces. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 180-187, October 1996.
See also earlier version thakur:abstract-tr.

Abstract: In this paper, we propose a strategy for implementing parallel-I/O interfaces portably and efficiently. We have defined an abstract-device interface for parallel I/O, called ADIO. Any parallel-I/O API can be implemented on multiple file systems by implementing the API portably on top of ADIO, and implementing only ADIO on different file systems. This approach simplifies the task of implementing an API and yet exploits the specific high-performance features of individual file systems. We have used ADIO to implement the Intel PFS interface and subsets of MPI-IO and IBM PIOFS interfaces on PFS, PIOFS, Unix, and NFS file systems. Our performance studies indicate that the overhead of using ADIO as an implementation strategy is very low.

Keywords: parallel I/O, multiprocessor file system interface, pario-bib

thakur:abstract-tr:
Rajeev Thakur, William Gropp, and Ewing Lusk. An abstract-device interface for implementing portable parallel-I/O interfaces. Technical Report MCS-P592-0596, Argonne National Laboratory, Mathematics and Computer Science Division, May 1996.
See also later version thakur:abstract.

Keywords: multiprocessor file system interface, parallel I/O, pario-bib

Comment: They propose an intermediate interface that can serve as an implementation base for all parallel file-system APIs, and which can itself be implemented on top of all parallel file systems. This ``universal'' interface allows all apps to run on all file systems with no porting, and for people to experiment with different APIs.

thakur:applications:
Rajeev Thakur, Ewing Lusk, and William Gropp. I/O in parallel applications: The weakest link. The International Journal of High Performance Computing Applications, 12(4):389-395, Winter 1998. In a Special Issue on I/O in Parallel Applications.

Abstract: Parallel computers are increasingly being used to run large-scale applications that also have huge I/O requirements. However, many applications obtain poor I/O performance on modern parallel machines. This special issue of IJSA contains papers that describe the I/O requirements and the techniques used to perform I/O in real parallel applications. We first explain how the I/O application program interface (API) plays a critical role in enabling such applications to achieve high I/O performance. We describe how the commonly used Unix I/O interface is inappropriate for parallel I/O and how an explicitly parallel API with support for collective I/O can help the underlying I/O hardware and software perform I/O efficiently. We then describe MPI-IO, a recently defined, standard, portable API specifically designed for high-performance parallel I/O. We conclude with an overview of the papers in this special issue.

Keywords: parallel I/O application, pario-bib

thakur:astrophysics:
Rajeev Thakur, Ewing Lusk, and William Gropp. I/O characterization of a portable astrophysics application on the IBM SP and Intel Paragon. Technical Report MCS-P534-0895, Argonne National Laboratory, August 1995. Revised October 1995.

Abstract: Many large-scale applications on parallel machines are bottlenecked by the I/O performance rather than the CPU or communication performance of the system. To improve the I/O performance, it is first necessary for system designers to understand the I/O requirements of various applications. This paper presents the results of a study of the I/O characteristics and performance of a real, I/O-intensive, portable, parallel application in astrophysics, on two different parallel machines-the IBM SP and the Intel Paragon. We instrumented the source code to record all I/O activity, and analyzed the resulting trace files. Our results show that, for this application, the I/O consists of fairly large writes, and writing data to files is faster on the Paragon, whereas opening and closing files are faster on the SP. We also discuss how the I/O performance of this application could be improved; particularly, we believe that this application would benefit from using collective I/O.

Keywords: file access pattern, workload characterization, parallel I/O, pario-bib

Comment: Adds another data point to the collection of parallel scientific applications whose I/O has been characterized, a collection started in earnest by crandall:iochar. It's a pretty straightforward application; it just writes its matrices every few timesteps. The application writes whole matrices; the OS sees request sizes that are more a factor of the Chameleon library than of the application. Most of the I/O itself is not implemented in parallel, because they used UniTree on the SP, and because the Chameleon library sequentializes this kind of I/O through one node. Other numbers from the paper don't add much insight into the workload. Revised slightly in October 1995; the abstract represents that revision.

thakur:evaluation:
Rajeev Thakur, William Gropp, and Ewing Lusk. An experimental evaluation of the parallel I/O systems of the IBM SP and Intel Paragon using a production application. In Proceedings of the Third International Conference of the Austrian Center for Parallel Computation (ACPC), volume 1127 of Lecture Notes in Computer Science, pages 24-35. Springer-Verlag, September 1996.
See also earlier version thakur:evaluation-tr.

Abstract: We present the results of an experimental evaluation of the parallel I/O systems of the IBM SP and Intel Paragon using a real three-dimensional parallel application code. This application, developed by scientists at the University of Chicago, simulates the gravitational collapse of self-gravitating gaseous clouds. It performs parallel I/O by using library routines that we developed and optimized separately for the SP and Paragon. The I/O routines perform two-phase I/O and use the parallel file systems PIOFS on the SP and PFS on the Paragon. We studied the I/O performance for two different sizes of the application. In the small case, we found that I/O was much faster on the SP. In the large case, open, close, and read operations were only slightly faster, and seeks were significantly faster, on the SP; whereas, writes were slightly faster on the Paragon. The communication required within our I/O routines was faster on the Paragon in both cases. The highest read bandwidth obtained was 48\,Mbytes/sec., and the highest write bandwidth obtained was 31.6\,Mbytes/sec., both on the SP.

Keywords: parallel I/O, multiprocessor file system, workload characterization, pario-bib

thakur:evaluation-tr:
Rajeev Thakur, William Gropp, and Ewing Lusk. An experimental evaluation of the parallel I/O systems of the IBM SP and Intel Paragon using a production application. Technical Report MCS-P569-0296, Argonne National Laboratory, February 1996.
See also later version thakur:evaluation.

Abstract: This paper presents the results of an experimental evaluation of the parallel I/O systems of the IBM SP and Intel Paragon. For the evaluation, we used a full, three-dimensional application code that is in production use for studying the nonlinear evolution of Jeans instability in self-gravitating gaseous clouds. The application performs I/O by using library routines that we developed and optimized separately for parallel I/O on the SP and Paragon. The I/O routines perform two-phase I/O and use the PIOFS file system on the SP and PFS on the Paragon. We studied the I/O performance for two different sizes of the application. We found that for the small case, I/O was faster on the SP, whereas for the large case, I/O took almost the same time on both systems. Communication required for I/O was faster on the Paragon in both cases. The highest read bandwidth obtained was 48 Mbytes/sec. and the highest write bandwidth obtained was 31.6 Mbytes/sec., both on the SP.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: This version no longer on the web.

thakur:ext2phase:
Rajeev Thakur and Alok Choudhary. Accessing sections of out-of-core arrays using an extended two-phase method. Technical Report SCCS-685, NPAC, Syracuse University, January 1995.
See also later version thakur:ext2phase2.

Abstract: In out-of-core computations, data needs to be moved back and forth between main memory and disks during program execution. In this paper, we propose a technique called the Extended Two-Phase Method, for accessing sections of out-of-core arrays efficiently. This is an extension and generalization of the Two-Phase Method for reading in-core arrays from files, which was previously proposed in [Rosario93,Bordawekar93]. The Extended Two-Phase Method uses collective I/O in which all processors cooperate to perform I/O in an efficient manner by combining several I/O requests into fewer larger requests, eliminating multiple disk accesses for the same data and reducing contention for disks. We describe the algorithms for reading as well as writing array sections. Performance results on the Intel Touchstone Delta for many different access patterns are presented and analyzed. It is observed that the Extended Two-Phase Method gives consistently good performance over a wide range of access patterns.

Keywords: parallel I/O, pario-bib

Comment: Revised as thakur:ext2phase2 and thakur:jext2phase.

thakur:ext2phase2:
Rajeev Thakur and Alok Choudhary. An extended two-phase method for accessing sections of out-of-core arrays. Technical Report CACR-103, Scalable I/O Initiative, Center for Advanced Computing Research, Caltech, June 1995. Revised November 1995.
See also earlier version thakur:ext2phase.
See also later version thakur:jext2phase.

Abstract: A number of applications on parallel computers deal with very large data sets which cannot fit in main memory. In such cases, data must be stored in files on disks and fetched into main memory during program execution. In programs with large out-of-core arrays stored in files, it is necessary to read/write smaller sections of the arrays from/to files. This paper describes a method, called the extended two-phase method, for accessing sections of out-of-core arrays in an efficient manner. This method uses collective I/O in which processors cooperate to combine several I/O requests into fewer larger granularity requests, reorder requests so that the file is accessed in proper sequence, and eliminate simultaneous I/O requests for the same data. The I/O workload is divided among processors dynamically, depending on the access requests. We present performance results for two real, out-of-core, parallel applications - matrix multiplication and a Laplace's equation solver - and several synthetic access patterns. The results indicate that the extended two-phase method provides a significant performance improvement over a direct method for I/O.

Keywords: parallel I/O, pario-bib

Comment: Revised version of thakur:ext2phase. The tech report was itself revised in November 1995; the abstract represents that revision.

thakur:jext2phase:
Rajeev Thakur and Alok Choudhary. An Extended Two-Phase Method for Accessing Sections of Out-of-Core Arrays. Scientific Programming, 5(4):301-317, Winter 1996.
See also earlier version thakur:ext2phase2.

Abstract: A number of applications on parallel computers deal with very large data sets that cannot fit in main memory. In such applications, data must be stored in files on disks and fetched into memory during program execution. Parallel programs with large out-of-core arrays stored in files must read/write smaller sections of the arrays from/to files. In this article, we describe a method for accessing sections of out-of-core arrays efficiently. Our method, the extended two-phase method, uses collective I/O: Processors cooperate to combine several I/O requests into fewer larger granularity requests, reorder requests so that the file is accessed in proper sequence, and eliminate simultaneous I/O requests for the same data. In addition, the I/O workload is divided among processors dynamically, depending on the access requests. We present performance results obtained from two real out-of-core parallel applications-matrix multiplication and a Laplace's equation solver-and several synthetic access patterns, all on the Intel Touchstone Delta. These results indicate that the extended two-phase method significantly outperformed a direct (noncollective) method for accessing out-of-core array sections.

Keywords: parallel I/O, pario-bib

thakur:jpassion:
Rajeev Thakur, Alok Choudhary, Rajesh Bordawekar, Sachin More, and Sivaramakrishna Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70-78, June 1996.

Abstract: Parallel computers with peak performance of more than 100 Gflops/second are already available to solve a variety of problems in a range of disciplines. However, the input/output performance of these machines is a poor reflection of their true computational power.

To improve the I/O performance of parallel programs with distributed multidimensional arrays, we have developed a software library called Passion (Parallel, Scalable Software for Input/Output). Passion's routines are designed to read or write either entire distributed arrays or sections of such arrays. Passion also frees the programmer from many of the tedious tasks associated with performing I/O in parallel programs and has a high-level interface that makes it easy to specify the required I/O.

We have implemented Passion on Intel's Paragon, Touchstone Delta, and iPSC/860 systems, and on the IBM SP system. We have also made it publicly available through the World Wide Web (http://www.cat.syr.edu/passion.html). We are in the process of porting the library to other machines and extending its functionality.

Keywords: parallel I/O, pario-bib

Comment: See thakur:passion, choudhary:passion.

thakur:mpi:
Rajeev Thakur, William Gropp, and Ewing Lusk. A case for using MPI's derived datatypes to improve I/O performance. In Proceedings of SC98: High Performance Networking and Computing. ACM Press, November 1998.
See also earlier version thakur:mpi-tr.

Abstract: MPI-IO, the I/O part of the MPI-2 standard, is a promising new interface for parallel I/O. A key feature of MPI-IO is that it allows users to access several noncontiguous pieces of data from a file with a single I/O function call by defining file views with derived datatypes. We explain how critical this feature is for high performance, why users must create and use derived datatypes whenever possible, and how it enables implementations to perform optimizations. In particular, we describe two optimizations our MPI-IO implementation, ROMIO, performs: data sieving and collective I/O. We demonstrate the performance and portability of the approach with performance results on five different parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.

Keywords: MPI, parallel I/O, pario-bib

thakur:mpi-io-implement:
Rajeev Thakur, William Gropp, and Ewing Lusk. On implementing MPI-IO portably and with high performance. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems, pages 23-32, May 1999.
See also earlier version thakur:mpi-io-implement-tr.

Abstract: We discuss the issues involved in implementing MPI-IO portably on multiple machines and file systems and also achieving high performance. One way to implement MPI-IO portably is to implement it on top of the basic Unix I/O functions (open, lseek, read, write, and close), which are themselves portable. We argue that this approach has limitations in both functionality and performance. We instead advocate an implementation approach that combines a large portion of portable code and a small portion of code that is optimized separately for different machines and file systems. We have used such an approach to develop a high-performance, portable MPI-IO implementation, called ROMIO.

In addition to basic I/O functionality, we consider the issues of supporting other MPI-IO features, such as 64-bit file sizes, noncontiguous accesses, collective I/O, asynchronous I/O, consistency and atomicity semantics, user-supplied hints, shared file pointers, portable data representation, and file preallocation. We describe how we implemented each of these features on various machines and file systems. The machines we consider are the HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, SGI Origin2000, and networks of workstations; and the file systems we consider are HP HFS, IBM PIOFS, Intel PFS, NEC SFS, SGI XFS, NFS, and any general Unix file system (UFS).

We also present our thoughts on how a file system can be designed to better support MPI-IO. We provide a list of features desired from a file system that would help in implementing MPI-IO correctly and with high performance.

Keywords: parallel I/O, multiprocessor file system interface, pario-bib

thakur:mpi-io-implement-tr:
Rajeev Thakur, William Gropp, and Ewing Lusk. On implementing MPI-IO portably and with high performance. Technical Report ANL/MCS-P732-1098, Mathematics and Computer Science Division, Argonne National Laboratory, October 1998.
See also later version thakur:mpi-io-implement.

Abstract: We discuss the issues involved in implementing MPI-IO portably on multiple machines and file systems and also achieving high performance. One way to implement MPI-IO portably is to implement it on top of the basic Unix I/O functions (open, lseek, read, write, and close), which are themselves portable. We argue that this approach has limitations in both functionality and performance. We instead advocate an implementation approach that combines a large portion of portable code and a small portion of code that is optimized separately for different machines and file systems. We have used such an approach to develop a high-performance, portable MPI-IO implementation, called ROMIO.

In addition to basic I/O functionality, we consider the issues of supporting other MPI-IO features, such as 64-bit file sizes, noncontiguous accesses, collective I/O, asynchronous I/O, consistency and atomicity semantics, user-supplied hints, shared file pointers, portable data representation, file preallocation, and some miscellaneous features. We describe how we implemented each of these features on various machines and file systems. The machines we consider are the HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, SGI Origin2000, and networks of workstations; and the file systems we consider are HP HFS, IBM PIOFS, Intel PFS, NEC SFS, SGI XFS, NFS, and any general Unix file system (UFS).

We also present our thoughts on how a file system can be designed to better support MPI-IO. We provide a list of features desired from a file system that would help in implementing MPI-IO correctly and with high performance.

Keywords: parallel I/O, multiprocessor file system interface, pario-bib

thakur:mpi-tr:
Rajeev Thakur, William Gropp, and Ewing Lusk. A case for using MPI's derived datatypes to improve I/O performance. Technical Report ANL/MCS-P717-0598, Mathematics and Computer Science Division, Argonne National Laboratory, May 1998.
See also later version thakur:mpi.

Abstract: MPI-IO, the I/O part of the MPI-2 standard, is a promising new interface for parallel I/O. A key feature of MPI-IO is that it allows users to access several noncontiguous pieces of data from a file with a single I/O function call by defining file views with derived datatypes. We explain how critical this feature is for high performance, why users must create and use derived datatypes whenever possible, and how it enables implementations to perform optimizations. In particular, we describe two optimizations our MPI-IO implementation, ROMIO, performs: data sieving and collective I/O. We present performance results on five different parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.

Keywords: MPI, parallel I/O, pario-bib

thakur:noncontigous:
Rajeev Thakur, William Gropp, and Ewing Lusk. Optimizing noncontiguous accesses in MPI-IO. Parallel Computing, 28(1):83-105, January 2002.

Abstract: The I/O access patterns of many parallel applications consist of accesses to a large number of small, noncontiguous pieces of data. If an application's I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access noncontiguous data with a single I/O function call, unlike in Unix I/O. In this paper, we explain how critical this feature of MPI-IO is for high performance and how it enables implementations to perform optimizations. We first provide a classification of the different ways of expressing an application's I/O needs in MPI-IO-we classify them into four levels, called level 0 through level 3. We demonstrate that, for applications with noncontiguous access patterns, the I/O performance improves dramatically if users write their applications to make level-3 requests (noncontiguous, collective) rather than level-0 requests (Unix style). We then describe how our MPI-IO implementation, ROMIO, delivers high performance for noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how we have implemented these optimizations portably on multiple machines and file systems, controlled their memory requirements, and also achieved high performance. We demonstrate the performance and portability with performance results for three applications-an astrophysics-application template (DIST3D), the NAS BTIO benchmark, and an unstructured code (UNSTRUC)-on five different parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.

Keywords: parallel I/O, parallel I/O, MPI-IO, collective I/O, data sieving, pario-bib

thakur:out-of-core:
Rajeev Thakur, Rajesh Bordawekar, and Alok Choudhary. Compilation of out-of-core data parallel programs for distributed memory machines. In Proceedings of the IPPS '94 Workshop on Input/Output in Parallel Computer Systems, pages 54-72. Syracuse University, April 1994. Also appeared in Computer Architecture News 22(4).
See also later version thakur:out-of-core-book.

Keywords: parallel I/O, pario-bib

Comment: Earlier version available as NPAC/Syracuse tech report. They describe the design of an HPF compiler that can translate out-of-core programs into plain programs with explicit I/O. For the most part, they discuss many of the issues involved in manipulating the arrys, and some of the alternatives for run-time support. The out-of-core array is broken into pieces, one per processor. Each processor keeps its local array piece in a file on its own logical disk, and reads and writes pieces of that file as needed. Some of the tradeoffs appear to contrast the amount of I/O with the ability to optimize communication: they choose a method called ``out-of-core communication'' because it simplifies the analysis of communication patterns, although it requires more I/O. The compiler depends on run-time routines for support; the run-time routines hide a lot of the architectural details, simplifying the job of the compiler and making the resulting program more portable. There are some preliminary performance numbers.

thakur:out-of-core-book:
Rajeev Thakur and Alok Choudhary. Runtime support for out-of-core parallel programs. In Jain et al. [iopads-book], chapter 6, pages 147-165.
See also earlier version thakur:out-of-core.

Abstract: In parallel programs with large out-of-core arrays stored in files, it is necessary to read/write smaller sections of the arrays from/to files. We describe a runtime method for accessing sections of out-of-core arrays efficiently. This method, called the extended two-phase method, uses collective I/O in which processors cooperate to read/write out-of-core data in an efficient manner. The I/O workload is divided among processors dynamically, depending on the access requests. Performance results on the Intel Touchstone Delta show that the extended two-phase method performs considerably better than a direct method for different access patterns, array sizes, and number of processors. We have used the extended two-phase method in the PASSION runtime library for parallel I/O.

Keywords: parallel I/O, out-of-core, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

thakur:passion:
Rajeev Thakur, Rajesh Bordawekar, Alok Choudhary, Ravi Ponnusamy, and Tarvinder Singh. PASSION runtime library for parallel I/O. In Proceedings of the Scalable Parallel Libraries Conference, pages 119-128, October 1994.
See also later version thakur:jpassion.

Abstract: We are developing a compiler and runtime support system called PASSION: Parallel And Scalable Software for Input-Output. PASSION provides software support for I/O intensive out-of-core loosely synchronous problems. This paper gives an overview of the PASSION Runtime Library and describes two of the optimizations incorporated in it, namely Data Prefetching and Data Sieving. Performance improvements provided by these optimizations on the Intel Touchstone Delta are discussed, together with an out-of-core Median Filtering application.

Keywords: parallel I/O, pario-bib

Comment: See thakur:jpassion. They describe the PASSION library for parallel I/O, though the description is fairly high-level. The main things that this paper adds to earlier papers from this group is a discussion of Data Prefetching (which is really just an asynchronous I/O interface that their compiler uses for prefetching) and Data Sieving, which they use when the application needs to read some array section that is not contiguous in the file; for example, a submatrix of a 2-d matrix from in a file stored row-major. Their solution is to read the complete set of rows (or columns, depending on file layout) in one huge read, into a memory buffer, and then extract the necessary data. Basically, this is another form of the two-phase strategy.

thakur:romio:
Rajeev Thakur, William Gropp, and Ewing Lusk. Data sieving and collective I/O in ROMIO. In Proceedings of the Seventh Symposium on the Frontiers of Massively Parallel Computation, pages 182-189. IEEE Computer Society Press, February 1999.
See also earlier version thakur:romio-tr.

Abstract: The I/O access patterns of parallel programs often consist of accesses to a large number of small, noncontiguous pieces of data. If an application's I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access a noncontiguous data set with a single I/O function call. This feature provides MPI-IO implementations an opportunity to optimize data access.

We describe how our MPI-IO implementation, ROMIO, delivers high performance in the presence of noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how one can implement these optimizations portably on multiple machines and file systems, control their memory requirements, and also achieve high performance. We demonstrate the performance and portability with performance results for three applications-an astrophysics-application template (DIST3D), the NAS BTIO benchmark, and an unstructured code (UNSTRUC)-on five different parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.

Keywords: parallel I/O, collective I/O, application programmer interface, pario-bib

Comment: They describe how ROMIO, their MPI-IO implementation, delivers high performance through the use of data sieving and collective I/O. The paper discusses several specific optimizations. They have results from five major parallel platforms. The paper confirms that the UNIX interface is terrible for many parallel access patterns, and that collective I/O is an important solution.

thakur:romio-tr:
Rajeev Thakur, William Gropp, and Ewing Lusk. Data sieving and collective I/O in ROMIO. Technical Report ANL/MCS-P723-0898, Mathematics and Computer Science Division, Argonne National Laboratory, August 1998.
See also later version thakur:romio.

Abstract: The I/O access patterns of parallel programs often consist of accesses to a large number of small, noncontiguous pieces of data. If an application's I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access a noncontiguous data set with a single I/O function call. This feature provides MPI-IO implementations an opportunity to optimize data access. We describe how our MPI-IO implementation, ROMIO, delivers high performance in the presence of noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how one can implement these optimizations portably on multiple machines and file systems, control their memory requirements, and also achieve high performance. We demonstrate the performance and portability with performance results for three applications- an astrophysics-application template (DIST3D), the NAS BTIO benchmark, and an unstructured code (UNSTRUC)- on five different parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.

Keywords: parallel I/O, collective I/O, application programmer interface, pario-bib

thakur:romio-users:
Rajeev Thakur, Ewing Lusk, and William Gropp. Users guide for ROMIO: A high-performance, portable MPI-IO implementation. Technical Report ANL/MCS-TM-234, Mathematics and Computer Science Division, Argonne National Laboratory, October 1997.

Abstract: ROMIO is a high-performance, portable implementation of MPI-IO (the I/O chapter in MPI-2). This document describes how to install and use ROMIO version 1.0.0 on various machines.

Keywords: file system interface, parallel I/O, pario-bib

thakur:runtime:
R. Thakur, R. Bordawekar, and A. Choudhary. Compiler and Runtime Support for Out-of-Core HPF Programs. In Proceedings of the 8th ACM International Conference on Supercomputing, pages 382-391, Manchester, UK, July 1994. ACM Press.

Abstract: This paper describes the design of a compiler which can translate out-of-core programs written in a data parallel language like HPF. Such a compiler is required for compiling large scale scientific applications, such as the Grand Challenge applications, which deal with enormous quantities of data. We propose a framework by which a compiler together with appropriate runtime support can translate an out-of-core HPF program to a message passing node program with explicit parallel I/O. We describe the basic model of the compiler and the various transformations made by the compiler. We also discuss the runtime routines used by the compiler for I/O and communication. In order to minimize I/O, the runtime support system can reuse data already fetched into memory. The working of the compiler is illustrated using two out-of-core applications, namely a Laplace equation solver and LU Decomposition, together with performance results on the Intel Touchstone Delta.

Keywords: parallel I/O, pario-bib

Comment: They describe ways to make HPF handle out-of-core arrays. Basically, they add directives to say which arrays are out of core, and how much memory to devote to the in-core portion of the array. Then the compiler distributes the array across processors, as in HPF, to form local arrays. Each local array is broken into slabs, where each slab can fit in local memory. The local array is kept in a local array file, from which slabs are loaded and stored. Ghost nodes are also handled. They were careful to avoid double I/O when one slab is another slab's ghost node. They found it most convenient to do all the communication between iterations, then do all the computation for that iteration, where the iteration itself required a loop including both computation and I/O. This means that there may need to be I/O during the communication phase, to store ghost nodes coming in from other places. They do not mention use of asynchronous I/O for overlap. See also bordawekar:efficient.

thakur:thesis:
Rajeev Thakur. Runtime Support for In-Core and Out-of-Core Data-Parallel Programs. PhD thesis, Department of Electrical and Computer Engineering, Syracuse University, May 1995.

Abstract: Distributed memory parallel computers or distributed computer systems are widely recognized as the only cost-effective means of achieving teraflops performance in the near future. However, the fact remains that they are difficult to program and advances in software for these machines have not kept pace with advances in hardware. This thesis addresses several issues in providing runtime support for in-core as well as out-of-core programs on distributed memory parallel computers. This runtime support can be directly used in application programs for greater efficiency, portability and ease of programming. It can also be used together with a compiler to translate programs written in a high-level data-parallel language like High Performance Fortran (HPF) to node programs for distributed memory machines.

In distributed memory programs, it is often necessary to change the distribution of arrays during program execution. This thesis presents efficient and portable algorithms for runtime array redistribution. The algorithms have been implemented on the Intel Touchstone Delta and are found to scale well with the number of processors and array size. This thesis also presents algorithms for all-to-all collective communication on fat-tree and two-dimensional mesh interconnection topologies. The performance of these algorithms on the CM-5 and Touchstone Delta is studied extensively. A model for estimating the time taken by these algorithms on the basis of system parameters is developed and validated by comparing with experimental results.

A number of applications deal with very large data sets which cannot fit in main memory, and hence have to be stored in files on disks, resulting in out-of-core programs. This thesis also describes the design and implementation of efficient runtime support for out-of-core computations. Several optimizations for accessing out-of-core data are presented. An Extended Two-Phase Method is proposed for accessing sections of out-of-core arrays efficiently. This method uses collective I/O and the I/O workload is divided among processors dynamically, depending on the access requests. Performance results obtained using this runtime support for out-of-core programs on the Touchstone Delta are presented.

Keywords: parallel I/O, runtime library, pario-bib

think:cm-2:
Connection Machine model CM-2 technical summary. Technical Report HA87-4, Thinking Machines, April 1987.

Keywords: parallel I/O, connection machine, disk array, disk architecture, SIMD, pario-bib

Comment: I/O and Data Vault, pp. 27-30

think:cm5:
The Connection Machine CM-5 Technical Summary. Thinking Machines Corporation, October 1991.

Keywords: computer architecture, connection machine, MIMD, SIMD, parallel I/O, pario-bib

Comment: Some detail but still skips over some key aspects (like communication topology. Neat communications support makes for user-mode message-passing, broadcasting, reductions, all built in. Lots of info here. File system calls allows data to be transferred in parallel directly from I/O node to processing node, bypassing the partition and I/O management nodes. Multiple I/O devices (even DataVaults) can be logically striped. See also best:cmmdio, loverso:sfs, think:cmmd, think:sda.

think:cm5io:
The CM-5 I/O system. Thinking Machines Corporation glossy, 1993.

Keywords: parallel I/O, disk array, striping, RAID, HIPPI, pario-bib

Comment: More detail about I/O nodes than think:sda, including info about disk storage nodes, HIPPI nodes, and tape nodes (ITS).

think:cmmd:
Thinking Machines Corporation. CMMD User's Guide, January 1992.

Keywords: MIMD, parallel programming, parallel I/O, message-passing, pario-bib

think:sda:
CM-5 scalable disk array. Thinking Machines Corporation glossy, November 1992.

Keywords: parallel I/O, disk array, striping, RAID, pario-bib

Comment: Disk storage nodes (processor, network interface, buffer, 4 SCSI controllers, 8 disks) attach individually to the CM-5 network. The software stripes across all nodes in the system. Thus, the collection of nodes is called a disk array. Multiple file systems across the array. Flexible redundancy. RAID 3 is used, i.e., bit-striped and a single parity disk. Remote access via NFS supported. Files stored in canonical order, with special hardware to help distribute data across processors. See best:cmmdio.

thomas:panda:
Joel T. Thomas. The Panda array I/O library on the Galley parallel file system. Technical Report PCS-TR96-288, Dept. of Computer Science, Dartmouth College, June 1996. Senior Honors Thesis.

Abstract: The Panda Array I/O library, created at the University of Illinois, Urbana-Champaign, was built especially to address the needs of high-performance scientific applications. I/O has been one of the most frustrating bottlenecks to high performance for quite some time, and the Panda project is an attempt to ameliorate this problem while still providing the user with a simple, high-level interface. The Galley File System, with its hierarchical structure of files and strided requests, is another attempt at addressing the performance problem. My project was to redesign the Panda Array library for use on the Galley file system. This project involved porting Panda's three main functions: a checkpoint function for writing a large array periodically for 'safekeeping,' a restart function that would allow a checkpointed file to be read back in, and finally a timestep function that would allow the user to write a group of large arrays several times in a sequence. Panda supports several different distributions in both the compute-node memories and I/O-node disks.

We have found that the Galley File System provides a good environment on which to build high-performance libraries, and that the mesh of Panda and Galley was a successful combination.

Keywords: multiprocessor file system, parallel I/O, pario-bib

Comment: See seamons:thesis.

thomasian:allocation:
Alexander Thomasian. Data allocation and scheduling in disks and disk arrays. Lecture Notes in Computer Science, 2965:357-384, April 2004.

Abstract: Magnetic disks, which together with disk arrays constitute a multibillion dollar industry, were developed in 1950s. Disks were an advance over magnetic drums, which had a dedicated read/write head per track, since much higher amounts of data could be accessed in a cost effective manner due to the sharability of the movable read/write heads. DRAM memories, which are volatile, were projected to replace disks a decade ago (see Section 2.4 in [33]). This did not materialize due to the inherent volatility of DRAM, i.e., a power source is required to ensure that DRAM contents are not lost, but also due to recent dramatic increases in areal recording density and hence disk capacity, which is estimated at 60% compound annual growth rate - CAGR. This has resulted in a rapid decrease in cost per megabyte of disk capacity, so that it is lower than DRAM by a factor of 1000 to one.

Keywords: data allocation, scheduling, disk arrays, pario-bib

tierney:cache:
Brian L. Tierney, Jason Lee, Brian Crowley, Mason Holding, Jeremy Hylton, and Fred L. Drake, Jr. A network-aware distributed storage cache for data-intensive environments. In Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing, pages 185-193, Redondo Beach, CA, August 1999. IEEE Computer Society Press.

Abstract: Modern scientific computing involves organizing, moving, visualizing, and analyzing massive amounts of data at multiple sites around the world. The technologies, the middleware services, and the architectures that are used to build useful high-speed, wide area distributed systems, constitute the field of data intensive computing. In this paper we will describe an architecture for data intensive applications where we use a high-speed distributed data cache as a common element for all of the sources and sinks of data. This cache-based approach provides standard interfaces to a large, application-oriented, distributed, on-line, transient storage system. We describe our implementation of this cache, how we have made it "network aware," and how we do dynamic load balancing based on the current network conditions. We also show large increases in application throughput by access to knowledge of the network conditions.

Keywords: distributed cache, distributed computing, grid, input/output, network-aware, parallel I/O, pario-bib

Comment: They discuss their implemetation of a "netowork aware" data cache (Distributed Parallel Storage System) that adapts to changing network conditions. The system itself looks much like the Galley File System. The client library is multi-threaded with a client thread for each DPSS server. A DPSS server is composed of a a block request thread, a block writer thread, a shared disk cache and a reader thread for each disk. Block requests move into the shared cache from the disks. A DPSS master directs the clients requests to an appropriate DPSS server. They use Java agents to monitor network performance and use a data replication for load balancing. A minimum cost flow algorithm is run each time a client request arrives to detirmine the best place to retrieve the data block. They argue that since the algorithm is fast (< 1 ms), the overhead of the algorithm is not significant.

tmc:cmio:
Thinking Machines Corporation. Programming the CM I/O System, November 1990.

Keywords: parallel I/O, file system interface, multiprocessor file system, pario-bib

Comment: Have two types of files, parallel and serial, differing in the way data is laid out internally. Also have three modes for reading the file: synchronous, streaming (asynchronous), and buffered.

tobis:foam:
Michael Tobis, Chad Schafer, Ian Foster, Robert Jacob, and John Anderson. FOAM: Expanding the horizons of climate modeling. In Proceedings of SC97: High Performance Networking and Computing. IEEE Computer Society Press, November 1997.

Abstract: We report here on a project that expands the applicability of dynamic climate modeling to very long time scales. The Fast Ocean_Atmosphere Model (FOAM) is a coupled ocean-atmosphere model that incorporates physics of interest in understanding decade to century time scale variability. It addresses the high computational cost of this endeavor with a combination of improved ocean model formulation, low atmosphere resolution, and efficient coupling. It also uses message-passing parallel processing techniques, allowing for the use of cost-effective distributed memory platforms. The resulting model runs over 6000 times faster than real time with good fidelity and has yielded significant results.

Keywords: parallel I/O, scientific application, pario-bib

Comment: This paper is about the Fast Ocean-Atmosphere Model (FOAM), a climate model that uses ''a combination of new model formulation and parallel computing to expand the time horizon that may be addressed by explicit fluid dynamical representations of the climate system.'' Their model uses message passing on massively parallel distributed-memory computer systems. They are in the process of investigating using parallel I/O to further increase their efficiency.

toledo:solar:
Sivan Toledo and Fred G. Gustavson. The design and implementation of SOLAR, a portable library for scalable out-of-core linear algebra computations. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 28-40, Philadelphia, May 1996. ACM Press.

Abstract: SOLAR is a portable high-performance library for out-of-core dense matrix computations. It combines portability with high performance by using existing high-performance in-core subroutine libraries and by using an optimized matrix input-output library. SOLAR works on parallel computers, workstations, and personal computers. It supports in-core computations on both shared-memory and distributed-memory machines, and its matrix input-output library supports both conventional I/O interfaces and parallel I/O interfaces. This paper discusses the overall design of SOLAR, its interfaces, and the design of several important subroutines. Experimental results show that SOLAR can factor on a single workstation an out-of-core positive-definite symmetric matrix at a rate exceeding 215 Mflops, and an out-of-core general matrix at a rate exceeding 195 Mflops. Less than 16% of the running time is spent on I/O in these computations. These results indicate that SOLAR's portability does not compromise its performance. We expect that the combination of portability, modularity, and the use of a high-level I/O interface will make the library an important platform for research on out-of-core algorithms and on parallel I/O.

Keywords: parallel I/O, out-of-core, linear algebra, pario-bib

Comment: Sounds great. Library package that supports LAPACK-like functionality on in-core and out-of-core matrices. Good performance. Good portability (IBM workstation, IBM SP-2, and OS/2 laptop). They separate the matrix algorithms from the underlying I/O routines in an interesting way (read and write submatrices), leaving just enough information to allow the I/O system to do some higher-level optimizations.

toledo:survey:
Sivan Toledo. A survey of out-of-core algorithms in numerical linear algebra. In Abello and Vitter [abello:dimacs], pages 161-180.

Keywords: out-of-core algorithm, survey, numerical analysis, linear algebra, pario-bib

Comment: See also the component papers vitter:survey, arge:lower, crauser:segment, grossi:crosstrees, toledo:survey. Not clear to what extent these papers are about *parallel* I/O.

tomkins:multi-process:
Andrew Tomkins, R. Hugo Patterson, and Garth Gibson. Informed multi-process prefetching and caching. In Proceedings of the 1997 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 100-114. ACM Press, June 1997.

Keywords: pario-bib

torrellas:PnetCDF:
Jianwei Li, Wei keng Liao, Alok Choudhary, Robert Ross, Rajeev Thakur, William Gropp, Rob Latham, Andrew Siegel, Brad Gallagher, and Michael Zingale. Parallel netCDF: A high-performance scientific I/O interface. In Proceedings of SC2003: High Performance Networking and Computing, Phoenix, AZ, November 2003. IEEE Computer Society Press.

Abstract: Dataset storage, exchange, and access play a critical role in scientific applications. For such purposes netCDF serves as a portable, efficient file format and programming interface, which is popular in numerous scientific application domains. However, the original interface does not provide an efficient mechanism for parallel data storage and access.

In this work, we present a new parallel interface for writing and reading netCDF datasets. This interface is derived with minimal changes from the serial netCDF interface but defines semantics for parallel access and is tailored for high performance. The underlying parallel I/O is achieved through MPI-IO, allowing for substantial performance gains through the use of collective I/O optimizations. We compare the implementation strategies and performance with HDF5. Our tests indicate programming convenience and significant I/O performance improvement with this parallel netCDF (PnetCDF) interface.

Keywords: parallel I/O interface, netCDF, MPI-IO, pario-bib

Comment: published on the web only

towsley:cpuio:
Donald F. Towsley. The effects of CPU: I/O overlap in computer system configurations. In Proceedings of the 5th Annual International Symposium on Computer Architecture, pages 238-241, April 1978.

Keywords: parallel processing, parallel I/O, pario-bib

Comment: Difficult to follow since it is missing its figures. ``Our most important result is that multiprocessor systems can benefit considerably more than single processor systems with the introduction of CPU: I/O overlap.'' They overlap I/O needed by some future CPU sequence with the current CPU operation. They claim it looks good for large numbers of processors. Their orientation seems to be for multiprocessors operating on independent tasks.

towsley:cpuio-parallel:
D. Towsley, K. M. Chandy, and J. C. Browne. Models for parallel processing within programs: Application to CPU: I/O and I/O: I/O overlap. Communications of the ACM, 21(10):821-831, October 1978.

Keywords: parallel processing, parallel I/O, pario-bib

Comment: Models CPU:I/O and I/O:I/O overlap within a program. ``Overlapping is helpful only when it allows a device to be utilized which would not be utilized without overlapping.'' In general the overlapping seems to help.

trabado:io:
Guillermo P. Trabado and E. L. Zapata. Support for massive data input/output on parallel computers. In Proceedings of the Fifth Workshop on Compilers for Parallel Computers, pages 347-356, June 1995.

Keywords: parallel I/O, sparse matrix, pario-bib

Comment: They discuss a library to support irregular data structures, really sparse matrices, on distributed-memory machines. Their library supports several in-memory and out-of-core data distributions, and routines to read and write matrices in those distributions. The paper is sketchy and poorly written. There is little material on I/O.

tran:adaptive:
Nancy Tran and Daniel A. Reed. ARIMA time series modeling and forecasting for adaptive I/O prefetching. In Proceedings of the 15th international conference on Supercomputing, pages 473-485, June 2001.

Abstract: Bursty application I/O patterns, together with transfer limited storage devices, combine to create a major I/O bottleneck on parallel systems. This paper explores the use of time series models to forecast application I/O request times, then prefetching I/O requests during computation intervals to hide I/O latency. Experimental results with I/O intensive scientific codes show performance improvements compared to standard UNIX prefetching strategies.

Keywords: pario-bib, access pattern, prefetching, modeling, time-series analysis

tran:jadaptive:
Nancy Tran and Daniel A. Reed. Automatic ARIMA time series modeling for adaptive I/O prefetching. IEEE Transactions on Parallel and Distributed Systems, 15(4):362-377, April 2004.
See also earlier version tran:adaptive.

Abstract: Inadequate I/O performance remains a major challenge in using high-end computing systems effectively. To address this problem, the paper presents TsModeler, an automatic time series modeling and prediction framework for adaptive I/O prefetching that uses ARIMA time series models to predict the temporal patterns of I/O requests. These online pattern analysis techniques and cutoff indicators for autocorrelation patterns enable multistep online predictions suitable for multiblock prefetching. This work also combines time series predictions with spatial Markov model predictions to determine when, what, and how many blocks to prefetch. Experimental results show reductions in execution time compared to the standard Linux file system across various hardware configurations.

Keywords: pario-bib, access pattern, prefetching, modeling, time-series analysis

triantafillou:overlay:
Peter Triantafillou and Christos Faloutsos. Overlay striping and optimal parallel I/O for modern applications. Parallel Computing, 24(1):21-43, January 1998.

Abstract: Disk array systems are rapidly becoming the secondary-storage media of choice for many emerging applications with large storage and high bandwidth requirements. Striping data across the disks of a disk array introduces significant performance benefits mainly because the effective transfer rate of the secondary storage is increased by a factor equal to the stripe width. However, the choice of the optimal stripe width is an open problem: no general formal analysis has been reported and intuition alone fails to provide good guidelines. As a result one may find occasionally contradictory recommendations in the literature. With this work we first contribute an analytical calculation of the optimal stripe width. Second, we recognize that the optimal stripe width is sensitive to the multiprogramming level, which is not known a priori and fluctuates with time. Thus, calculations of the optimal stripe width are, by themselves only, of little practical use. For this reason we propose a novel striping technique, called overlay striping, which allows objects to be retrieved using a number of alternative stripe widths. We provide the detailed algorithms for our overlay striping method and we study the associated storage overhead and performance improvements and we show that we can achieve near optimal performance for very wide ranges of the possible multiprogramming levels, while incurring small storage overheads.

Keywords: parallel I/O, striping, pario-bib

Comment: Part of a special issue.

tridgell:hidios:
Andrew Tridgell and David Walsh. The HiDIOS filesystem. In Proceedings of the Fourth International Parallel Computing Workshop, pages 53-63, London, England, September 1995.

Keywords: parallel file system, pario-bib

Comment: A description of their new parallel file system for the AP-1000. Conceptually, not much new here.

trieber:raid5:
Kent Treiber and Jai Menon. Simulation study of cached RAID5 designs. In Proceedings of the First Conference on High-Performance Computer Architecture, pages 186-197. IEEE Computer Society Press, January 1995.

Abstract: This paper considers the performance of cached RAID5 using simulations that are driven by database I/O traces collected at customer sites. This is in contrast to previous performance studies using analytical modelling or random-number simulations. We studied issues of cache size, disk buffering, cache replacement policies, cache allocation policies, destage policies and striping. Our results indicate that: read caching has considerable value; a small amount of cache should be used for writes fast write logic can reduce disk utilization for writes by an order of magnitude; priority queueing should be supported at the disks; disk buffering prefetch should be used; for large caches, it pays to cache sequentially accessed blocks; RAID5 with cylinder striping is superior to parity striping.

Keywords: parallel I/O, RAID, disk array, pario-bib

tsujita:mpi-io:
Yuichi Tsujita. Effective nonblocking MPI-I/O in remote I/O operations using a multithreaded mechanism. Lecture Notes in Computer Science, 3358:34-43, November 2004.

Abstract: A flexible intermediate library named Stampi realizes seamless MPI operations on interconnected parallel computers. Dynamic process creation and MPI-I/O operations both inside a computer and among computers are available with it. MPI-I/O operations to a remote computer are realized by MPI-I/O processes of the Stampi library which are invoked on a remote computer using a vendor-supplied MPI-I/O library. If the vendor-supplied one is not available, a single MPI-I/O process is invoked on a remote computer, and it uses UNIX I/O functions instead of the vendor-supplied one. In nonblocking MPI-I/O functions with multiple user processes, the single MPI-I/O process carries out I/O operations required by the processes sequentially. This results in small overlap of computation by the user processes with I/O operations by the MPI-I/O process. Therefore performance of the nonblocking functions is poor with multiple user processes. To realize effective I/O operations, a Pthreads library has been implemented in the MPI-I/O mechanism, and multi-threaded I/O operations have been realized. The newly implemented MPI-I/O mechanism has been evaluated on inter-connected PC clusters, and higher overlap of the computation with the I/O operations has been achieved.

Keywords: stampi, MPI-I/O, dynamic process creation, multithreaded, overlap computation and I/O, pario-bib

Comment: also see tsujita:stampi*.

tsujita:stampi:
Yuichi Tsujita, Toshiyuki Imamura, Hiroshi Takemiya, and Nobuhiro Yamagishi. Stampi-I/O: A flexible parallel-I/O library for heterogeneous computing environment. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 2474 of Lecture Notes in Computer Science, pages 288-? Springer-Verlag, 2002.

Abstract: An MPI-2 based parallel-I/O library, Stampi-I/O, has been developed using flexible communication infrastructure. In Stampi-I/O almost all MPI-I/O functions have been implemented. We can execute these functions using both local and remote I/O operations with the same application program interface (API) based on MPI-2. In I/O operations using Stampi-I/O, users need not handle any differences in the communication mechanism of computers. We have evaluated performance for primitive functions in Stampi-I/O. Through this test, sufficient performance has been achieved and effectiveness of our flexible implementation has been confirmed.

Keywords: parallel I/O, multiprocessor file system, pario-bib

tsujita:stampi2:
Yuichi Tsujita. Implementation of an MPI-I/0 mechanism using PVFS in remote I/0 to a PC cluster.. In Seventh International Conference on High Performance Computing and Grid in Asia Pacific Region, pages 136-139, Tokyo, Japan, July 2004. Kinki University, Japan, Los Alamitos, CA, USA : IEEE Comput. Soc, 2004.

Abstract: A flexible intermediate library named Stampi realizes seamless MPI operations on a heterogeneous computing environment. With the help of a flexible communication mechanism of this library, users can execute MPI functions without awareness of underlying communication mechanism. Although Stampi supports MPI-I/O among different platforms, UNIX I/O functions are used when a vendor-supplied MPI-I/O library is not available. To realize distributed I/O operations, a parallel virtual file system (PVFS) has been implemented in the MPI-I/O mechanism. Primitive MPI-I/O functions of Stampi have been evaluated and sufficient performance has been achieved. (9 refs.)

Keywords: MPI-IO, PVFS, remote I/O, grid, pario-bib

Comment: also see tsujita:stampi.

uk:protein-folding:
B. Uk, M. Taufer, T. Stricker, G. Settanni, A. Cavalli, and A. Caflisch. Combining task- and data parallelism to speed up protein folding on a desktop grid platform. In Proceedings of the Third IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 240-249, Tokyo, Japan, May 2003. IEEE Computer Society Press.

Abstract: The steady increase of computing power at lower and lower cost enables molecular dynamics simulations to investigate the process of protein folding with an explicit treatment of water molecules. Such simulations are typically done with well known computational chemistry codes like CHARMM. Desktop grids such as the United Devices MetaProcessor are highly attractive platforms, since scavenging for unused machines on Intra- and Internet delivers compute power that is almost free. However, the predominant programming paradigm for current desktop grids is pure task parallelism and might not fit the needs for protein folding simulations with explicit water molecules. A short overall turn-around time of a simulation remains highly important for research productivity, but the need for an accurate model and long simulation time-scales leads to tasks that are too large for optimal scheduling on a desktop grid. To address this problem, we introduce a combination of task- and data parallelism as a well suitable computing paradigm for protein folding investigations on grid platforms. As a proof of concept, we design and implement a simple system for protein folding simulations based on the notion of combined task and data parallelism with clustered workers. Clustered workers are machines grouped into small clusters according to network and CPU performance criteria and act as super-nodes within a desktop grid, permitting the utilization of data parallelism in addition to the task parallelism. We integrate our new paradigm into the existing software environment of the United Devices MetaProcessor. For a test protein, we reach a better quality of the folding calculations than we reached using just task parallelism on distributed systems.

Keywords: protein folding, grid application, parallel I/O, pario-app, pario-bib

uysal:mems:
Mustafa Uysal, Arif Merchant, and Guillermo A. Alvarez. Using MEMS-based storage in disk arrays. In Proceedings of the USENIX FAST '03 Conference on File and Storage Technologies, pages 89-101, San Francisco, CA, April 2003. USENIX Association.

Abstract: Current disk arrays, the basic building blocks of high-performance storage systems, are built around two memory technologies: magnetic disk drives, and non-volatile DRAM caches. Disk latencies are higher by six orders of magnitude than non-volatile DRAM access times, but cache costs over 1000 times more per byte. A new storage technology based on microelectromechanical systems (MEMS) will soon offer a new set of performance and cost characteristics that bridge the gap between disk drives and the caches. We evaluate potential gains in performance and cost by incorporating MEMS-based storage in disk arrays. Our evaluation is based on exploring potential placements of MEMS-based storage in a disk array. We used detailed disk array simulators to replay I/O traces of real applications for the evaluation. We show that replacing disks with MEMS-based storage can improve the array performance dramatically, with a cost performance ratio several times better than conventional arrays even if MEMS storage costs ten times as much as disk. We also demonstrate that hybrid MEMS/disk arrays, which cost less than purely MEMS-based arrays, can provide substantial improvements in performance and cost/performance over conventional arrays.

Keywords: mems-based storage, disk arrays, pario-bib

Comment: Best paper in fast2003.

vaitzblit:media:
Lev Vaitzblit. The design and implementation of a high-bandwidth file service for continuous media. Master's thesis, MIT, September 1991.

Keywords: multimedia, distributed file system, disk striping, pario-bib

Comment: A DFS for multimedia. Expect large files, read-mostly, highly sequential. Temporal synchronization is key. An administration server handles opens and closes, and provides guarantees on performance (like Swift). The interface at the client nodes talks to the admin server transparently, and stripes requests over all storage nodes. Storage nodes may internally use RAIDs, I suppose. Files are a series of frames, rather than bytes. Each frame has a time offset in seconds. Seeks can be by frame number or time offset. File containers contain several files, and have attributes that specify performance requirements. Interface does prefetching, based on read direction (forward or backward) and any frame skips. But frames are not transmitted from storage server to client node until requested (client pacing). Claim that synchronous disk interleaving with a striping unit of one frame is best. Could get 30 frames/sec (3.5MB/s) with 2 DECstation 5000s and 4 disks, serving a client DEC 5000.

vandegoor:unixio:
A. J. van de Goor and A. Moolenaar. UNIX I/O in a multiprocessor system. In Proceedings of the 1988 Winter USENIX Conference, pages 251-258, 1988.

Keywords: unix, multiprocessor file system, pario-bib

Comment: How to split up the internals of the Unix I/O system to run on a shared-memory multiprocessor in a non-symmetric OS. They decided to split the functionality just above the buffer cache level, putting the buffer cache management and device drivers on the special I/O processors.

vanderleest:contention:
Steven VanderLeest and Ravishankar K. Iyer. Measurement of I/O bus contention and correlation among heterogeneous device types in a single-bus multiprocessor system. In Proceedings of the IPPS '94 Workshop on Input/Output in Parallel Computer Systems, pages 36-53. Univ of Illinois, Urbana-Champaign, April 1994. Also appeared in Computer Architecture News 22(4).
See also later version sinclair:instability-book.

Keywords: parallel I/O, pario-bib

Comment: Using a hardware monitor they measure the I/O-bus usage on a 4-processor Sun workstation. They characterize the bus contention caused by multiple different devices (disk, screen, and network). The contention sometimes caused significant performance degradation (to the end-user) despite the bus not being overloaded.

vanderleest:contention-book:
Steven H. VanderLeest and Ravishankar K. Iyer. Heterogeneous I/O contention in a single-bus multiprocessor. In Jain et al. [iopads-book], chapter 14, pages 313-331.
See also earlier version vanderleest:contention.

Abstract: None.

Keywords: parallel I/O, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

varki:issues:
E. Varki, A. Merchant, J. Z. Xu, and X. Z. Qiu. Issues and challenges in the performance analysis of real disk arrays. IEEE Transactions on Parallel and Distributed Systems, 15(6):559 - 574, June 2004.

Abstract: The performance modeling and analysis of disk arrays is challenging due to the presence of multiple disks, large array caches, and sophisticated array controllers. Moreover, storage manufacturers may not reveal the internal algorithms implemented in their devices, so real disk arrays are effectively black-boxes. We use standard performance techniques to develop an integrated performance model that incorporates some of the complexities of real disk arrays. We show how measurement data and baseline performance models can be used to extract information about the various features implemented in a disk array. In this process, we identify areas for future research in the performance analysis of real disk arrays.

Keywords: performance analysis, disk arrays, performance modeling, pario-bib

varma:bdestage:
Anujan Varma and Quinn Jacobson. Destage algorithms for disk arrays with non-volatile caches. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 10, pages 129-146. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version varma:destage.

Keywords: parallel I/O, disk array, RAID, disk caching, pario-bib

Comment: Part of jin:io-book; reformatted version of varma:destage.

varma:destage:
Anujan Varma and Quinn Jacobson. Destage algorithms for disk arrays with non-volatile caches. IEEE Transactions on Computers, 47(2), February 1998.
See also later version varma:bdestage.

Abstract: In a disk array with a nonvolatile write cache, destages from the cache to the disk are performed in the background asynchronously while read requests from the host system are serviced in the foreground. In this paper, we study a number of algorithms for scheduling destages in a RAID-5 system. We introduce a new scheduling algorithm, called linear threshold scheduling, that adaptively varies the rate of destages to disks based on the instantaneous occupancy of the write cache. The performance of the algorithm is compared with that of a number of alternative scheduling approaches, such as least-cost scheduling and high/low mark. The algorithms are evaluated in terms of their effectiveness in making destages transparent to the servicing of read requests from the host, disk utilization, and their ability to tolerate bursts in the workload without causing an overflow of the write cache. Our results show that linear threshold scheduling provides the best read performance of all the algorithms compared, while still maintaining a high degree of burst tolerance. An approximate implementation of the linear-threshold scheduling algorithm is also described. The approximate algorithm can be implemented with much lower overhead, yet its performance is virtually identical to that of the ideal algorithm.

Keywords: parallel I/O, disk array, RAID, disk caching, pario-bib

varman:bounds:
Peter J. Varman and Rakesh M. Verma. Tight bounds for prefetching and buffer management algorithms for parallel I/O systems. IEEE Transactions on Parallel and Distributed Systems, 10(12):1262-1275, December 1999.

Abstract: The I/O performance of applications in multiple-disk systems can be improved by overlapping disk accesses. This requires the use of appropriate prefetching and buffer management algorithms that ensure the most useful blocks are accessed and retained in the buffer. In this paper, we answer several fundamental questions on prefetching and buffer management for distributed-buffer parallel I/O systems. First, we derive and prove the optimality of an algorithm, P-min, that minimizes the number of parallel I/Os. Second, we analyze P-con, an algorithm that always matches its replacement decisions with those of the well-known demand-paged MIN algorithm. We show that P-con can become fully sequential in the worst case. Third, we investigate the behavior of on-line algorithms for multiple-disk prefetching and buffer management. We define and analyze P-Iru, a parallel version of the traditional LRU buffer management algorithm. Unexpectedly, we find that the competitive ratio of P-Iru is independent of the number of disks. Finally, we present the practical performance of these algorithms on randomly generated reference strings. These results confirm the conclusions derived from the analysis on worst case inputs.

Keywords: parallel I/O, prefetching, pario-bib

vellanki:predict:
Vivekanand Vellanki and Ann Chervenak. A cost-benefit scheme for high performance predictive prefetching. In Proceedings of SC99: High Performance Networking and Computing, Portland, OR, November 1999. ACM Press and IEEE Computer Society Press.

Keywords: file prefetching, cost-benefit analysis, parallel I/O pario-bib

Comment: They describe a prefetching scheme which prefetches blocks using a cost-benefit analysis scheme based on the probability that the block will be accessed. The benefit of prefetching a block is compared to the cost of replacing another block from the cache. They were able to reduce cache miss rates by 36% for workloads which receive no benefit from sequential prefetching.

vengroff:TPIE:
Darren Erik Vengroff. A transparent parallel I/O environment. In Proceedings of the 1994 DAGS/PC Symposium, pages 117-134, Hanover, NH, July 1994. Dartmouth Institute for Advanced Graduate Studies.

Keywords: parallel I/O, parallel I/O algorithms, pario-bib

Comment: Interesting interface, providing high-level data-parallel access to vectors of data on disk. Implementation expectation is to use raw disk devices. Goals: abstraction, support for algorithmic optimality, flexible, portable, and extensible. TPIE is a set of C++ templates and libraries, where the user supplies callback functions to TPIE access methods. TPIE contains a small variety of access methods, each of which operates on a set of input and output streams, calling the user's function once for each set of input records. They can do scan, merge, distribution, sort, permute, batch filter, and distribution-sweep. There is a single thread of control (at least conceptually). Their first prototype is on a Sun SPARCstation; later, clusters of workstations and then a multiprocessor. See vengroff:efficient, vengroff:tpie-man.

vengroff:efficient:
Darren Erik Vengroff and Jeffrey Scott Vitter. Supporting I/O-efficient scientific computation in TPIE. In Proceedings of the 1995 IEEE Symposium on Parallel and Distributed Processing, pages 74-77, San Antonio, TX, October 1995. IEEE Computer Society Press.
See also earlier version vengroff:efficient-tr.
See also later version vengroff:efficient2.

Keywords: parallel I/O, algorithm, run-time library, pario-bib

Comment: Shorter version of vengroff:efficient2. Excellent paper. This paper does not describe TPIE itself very much, but more about a set of benchmarks using TPIE. All of the benchmarks are run on one disk and one processor. TPIE can use multiple disks and one processor, with plans to extend it to multiple processors later. See vengroff:tpie and vengroff:efficient-tr. Same as vengroff:efficient2?

vengroff:efficient-tr:
Darren Erik Vengroff and Jeffrey Scott Vitter. I/O-efficient scientific computation using TPIE. Technical Report CS-1995-18, Dept. of Computer Science, Duke University, July 1995.
See also later version vengroff:efficient.

Keywords: parallel I/O algorithm, scientific computing, runtime library, pario-bib

Comment: Expanded version of vengroff:efficient.

vengroff:efficient2:
Darren Erik Vengroff and Jeffrey Scott Vitter. I/O-efficient scientific computation using TPIE. In Proceedings of the Fifth NASA Goddard Conference on Mass Storage Systems and Technologies, pages II:553-570, September 1996.
See also earlier version vengroff:efficient.

Keywords: parallel I/O algorithms, run-time support, parallel I/O, multiprocessor file system interface, pario-bib

Comment: Longer version of vengroff:efficient.

vengroff:thesis:
Darren Erik Vengroff. The Theory and Practice of I/O-Efficient Computation. PhD thesis, Department of Computer Science, Brown University, Providence, RI, April 1996.

Keywords: parallel I/O algorithm, pario-bib

vengroff:tpie-man:
Darren Erik Vengroff. TPIE user manual and reference. Available on the WWW at http://www.cs.duke.edu/~dev/tpie_home_page.html, January 1995. Alpha release.

Keywords: parallel I/O, parallel I/O algorithm, file system interface, pario-bib

Comment: Currently an alpha version. It is in the process of being updated. The most current version is generally available on the WWW. See vengroff:tpie, vengroff:efficient.

venugopal:delays:
C. R. Venugopal and S. S. S. P. Rao. Impact of delays in parallel I/O system: An empirical study. In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, pages 490-499. IEEE Computer Society Press, August 1996.

Abstract: Performance of I/O intensive applications on a multiprocessor system depends mostly on the variety of disk access delays encountered in the I/O system. Over the years, the improvement in disk performance has taken place more slowly than the corresponding increase in processor speeds. It is therefore necessary to model I/O delays and evaluate performance benefits of moving an application to a better multiprocessor system. We perform such an analysis by measuring I/O delays for a synthesized application that uses a parallel distributed file system. The aim of this study is to evaluate the performance benefits of better disks in a multiprocessor system. We report on how the I/O performance would be affected if an application were to run on a system which would have better disks and communication links. In this study, we show a substantial improvement in the performance of an I/O system with better disks and communication links with respect to the existing system.

Keywords: parallel I/O, pario-bib

vetsch:visiblehuman:
S. Vetsch, V. Messerli, O. Figueiredo, B. Gennart, R.D. Hersch, L. Bovisi, R. Welz, L. Bidaut, and O. Ratib. Visible human slice server. http://visiblehuman.epfl.ch/, 1998. A web site giving access to 2D views of a 3D scan of a human body.

Abstract: The computer scientists of EPFL (Prof. R.D. Hersch and his staff), in collaboration with the Geneva Hospitals and WDS Technologies SA, have developed a parallel image server to extract image slices of the Visible Human from any orientation. This 3D dataset originates from a prisoner sentenced to death who offered his body to science. The dead body was frozen and then cut and digitized into 1 mm horizontally spaced slices by the National Library of Medicine, Bethesda-Maryland and the University of Colorado, USA. The total volume of all slices represents a size of 13 Gbyte of data.

Keywords: image processing, parallel I/O, pario-bib

Comment: Very cool. See also gennart:CAP, messerli:tomographic, messerli:jimage, messerli:thesis.

vilayannur:caching:
Murali Vilayannur, Anand Sivasubramaniam, Mahmut Kandemir, Rajeev Thakur, and Robert Ross. Discretionary caching for I/O on clusters. In Proceedings of the Third IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 96-103, Tokyo, Japan, May 2003. IEEE Computer Society Press.

Abstract: I/O bottlenecks are already a problem in many large-scale applications that manipulate huge datasets. This problem is expected to get worse as applications get larger, and the I/O subsystem performance lags behind processor and memory speed improvements. Caching I/O blocks is one effective way of alleviating disk latencies, and there can be multiple levels of caching on a cluster of workstations. Previous studies have shown the benefits of caching - whether it be local to a particular node, or a shared global cache across the cluster - for certain applications. However, we show that while caching is useful in some situations, it can hurt performance if we are not careful about what to cache and when to bypass the cache. This paper presents compilation techniques and runtime support to address this problem. These techniques are implemented and evaluated on an experimental Linux/Pentium cluster running a parallel file system. Our results using a diverse set of applications (scientific and commercial) demonstrate the benefits of a discretionary approach to caching for I/O subsystems on clusters, providing as much as 33% savings over indiscriminately caching everything in some applications.

Keywords: caching, parallel I/O, pario-bib

vilayannur:posix-pvfs:
Murali Vilayannur, Robert B. Ross, Philip H. Carns, Rajeev Thakur, and Anand Sivasubramaniam. On the performance of the posix i/o interface to pvfs. In 12th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP'04), pages 332 - 339, Coruna, Spain, February 2004. IEEE Computer Society Press.

Abstract: The ever-increasing gap in performance between CPU/memory technologies and the I/O subsystem (disks, I/O buses) in modern workstations has exacerbated the I/O bottlenecks inherent in applications that access large disk resident data sets. A common technique to alleviate the I/O bottlenecks on clusters of workstations, is the use of parallel file systems. One such parallel file system is the Parallel Virtual File System (PVFS), which is a freely available tool to achieve high-performance I/O on Linux-based clusters. In this paper we describe the performance and scalability of the UNIX I/O interface to PVFS. To illustrate the performance, we present experimental results using Bonnie++, a commonly used file system benchmark to test file system throughput; a synthetic parallel I/O applicationfor calculating aggregate read and write bandwidths; and a synthetic benchmark which calculates the time taken to untar the Linux kernel source tree to measure performance of a large number of small file operations. We obtained aggregate read and write bandwidths as high as 550 MB/s with a Myrinet-based network and 160MB/s with fast Ethernet.

Keywords: posix I/O interface, performance, PVFS, parallel file system, pario-bib

vitter:jprefetch:
Jeffrey Scott Vitter and P. Krishnan. Optimal prefetching via data compression. In Foundations of Computer Science, pages 121-130, 1991.
See also earlier version vitter:prefetch.

Keywords: prefetching, data compression, pario-bib

vitter:optimal:
Jeffrey Scott Vitter and Elizabeth A. M. Shriver. Optimal disk I/O with parallel block transfer. In Proceedings of the 22nd Annual ACM Symposium on Theory of Computing (STOC '90), pages 159-169, May 1990.

Keywords: parallel I/O algorithms, parallel memory, pario-bib

Comment: Summary of vitter:parmem1 and vitter:parmem2.

vitter:parmem1:
Jeffrey Scott Vitter and Elizabeth A. M. Shriver. Algorithms for parallel memory I: Two-level memories. Algorithmica, 12(2/3):110-147, August and September 1994.
See also earlier version vitter:parmem1-tr.

Abstract: We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our two-level memory model is new and gives a realistic treatment of parallel block transfer, in which during a single I/O each of the $P$ secondary storage devices can simultaneously transfer a contiguous block of $B$ records. The model pertains to a large-scale uniprocessor system or parallel multiprocessor system with $P$ disks. In addition, the sorting, FFT, permutation network, and standard matrix multiplication algorithms are typically optimal in terms of the amount of internal processing time. The difficulty in developing optimal algorithms is to cope with the partitioning of memory into $P$ separate physical devices. Our algorithms' performance can be significantly better than those obtained by the well-known but nonoptimal technique of disk striping. Our optimal sorting algorithm is randomized, but practical; the probability of using more than $\ell$ times the optimal number of I/Os is exponentially small in $\ell (\log \ell) \log (M/B)$, where $M$ is the internal memory size.

Keywords: parallel I/O algorithms, parallel memory, pario-bib

Comment: See shorter version vitter:optimal. See TR version vitter:parmem1-tr. See also vitter:parmem2.

vitter:parmem1-tr:
Jeffrey Scott Vitter and Elizabeth A. M. Shriver. Algorithms for parallel memory I: Two-level memories. Technical Report CS-93-01, Dept. of Computer Science, Duke University, January 1993. A summary appears in STOC '90. Revised version of Brown CS-92-04. Appeared in Algorithmica August 1994.
See also later version vitter:parmem1.

Keywords: parallel I/O algorithms, parallel memory, pario-bib

Comment: Summarized in vitter:optimal. Published as vitter:parmem1.

vitter:parmem2:
Jeffrey Scott Vitter and Elizabeth A. M. Shriver. Algorithms for parallel memory II: Hierarchical multilevel memories. Algorithmica, 12(2/3):148-169, August and September 1994.
See also earlier version vitter:parmem2-tr.

Abstract: In this paper we introduce parallel versions of two hierarchical memory models and give optimal algorithms in these models for sorting, FFT, and matrix multiplication. In our parallel models, there are $P$ memory hierarchies operating simultaneously; communication among the hierarchies takes place at a base memory level. Our optimal sorting algorithm is randomized and is based upon the probabilistic partitioning technique developed in the companion paper for optimal disk sorting in a two-level memory with parallel block transfer. The probability of using $\ell$ times the optimal running time is exponentially small in $\ell (\log \ell) \log P$.

Keywords: parallel I/O algorithms, parallel memory, pario-bib

Comment: Summarized in vitter:optimal.

vitter:parmem2-tr:
Jeffrey Scott Vitter and Elizabeth A. M. Shriver. Algorithms for parallel memory II: Hierarchical multilevel memories. Technical Report CS-93-02, Dept. of Computer Science, Duke University, January 1993. A summary appears in STOC '90. Revised version of Brown CS-92-05. Appeared in Algorithmica 12(2,3).
See also later version vitter:parmem2.

Keywords: parallel I/O algorithms, parallel memory, pario-bib

Comment: Summarized in vitter:optimal.

vitter:prefetch:
Jeffrey Scott Vitter and P. Krishnan. Optimal prefetching via data compression. Technical Report CS-91-46, Brown University, July 1991. A summary appears in FOCS '91.
See also later version vitter:jprefetch.

Abstract: Caching and prefetching are important mechanisms for speeding up access time to data on secondary storage. Recent work in competitive online algorithms has uncovered several promising new algorithms for caching. In this paper, we apply a form of the competitive philosophy for the first time to the problem of prefetching to develop an optimal universal prefetcher in terms of fault ratio, with particular applications to large-scale databases and hypertext systems. Our algorithms for prefetching are novel in that they are based on data compression techniques that are both theoretically optimal and good in practice. Intuitively, in order to compress data effectively, you have to be able to predict future data well, and thus good data compressors should be able to predict well for purposes of prefetching. We show for powerful models such as Markov sources and $m$th order Markov sources that the page fault rates incurred by our prefetching algorithms are optimal in the limit for almost all sequences of page accesses.

Keywords: parallel I/O algorithms, disk prefetching, pario-bib

Comment: ``This... is on prefetching, but I think the ideas will have a lot of use with parallel disks. The implementations we have now are doing amazingly well compared to LRU.'' [Vitter]. See vitter:jprefetch.

vitter:summary:
Jeffrey Scott Vitter. Efficient memory access in large-scale computation. In Proceedings of the 1991 Symposium on Theoretical Aspects of Computer Science (STACS '91), volume 480 of Lecture Notes in Computer Science, pages 26-41, Berlin, 1991. Springer-Verlag.

Keywords: parallel I/O algorithms, sorting, pario-bib

Comment: Good overview of all the other papers.

vitter:survey:
Jeffrey Scott Vitter. External memory algorithms and data structures: dealing with massive data. In Abello and Vitter [abello:dimacs], pages 1-38.

Keywords: out-of-core algorithm, pario-bib

Comment: Earlier shorter versions entitled "External Memory Algorithms" appear as an invited tutorial in Proceedings of the 17th ACM Symposium on Principles of Database Systems, Seattle, WA, June 1998, 119-128, and as an invited paper in Proceedings of the 6th Annual European Symposium on Algorithms, Venice, Italy, August 1998, 1-25, published in Lecture Notes in Computer Science, 1461, Springer-Verlag, Berlin

vitter:uniform:
Jeffrey Scott Vitter and Mark H. Nodine. Large-scale sorting in uniform memory hierarchies. Journal of Parallel and Distributed Computing, 17(1-2):107-114, January and February 1993.

Keywords: parallel I/O algorithm, sorting, pario-bib

Comment: Summary is nodine:sort.

vms:stripe:
Digital Equipment Corporation. VAX Disk Striping Driver for VMS, December 1989. Order Number AA-NY13A-TE.

Keywords: disk striping, pario-bib

Comment: Describes the VAX disk striping driver. Stripes an apparently arbitrary number of disk devices. All devices must be the same type, and apparently completely used. Manager can specify ``chunksize'', the number of logical blocks per striped block. They suggest using the track size of the device as the chunk size. They also point out that multiple controllers should be used in order to gain parallelism.

voelker:coop:
Geoffrey M. Voelker, Eric J. Anderson, Tracy Kimbrel, Michael J. Feeley, Jeffrey S. Chase, Anna R. Karlin, and Henry M. Levy. Implementing cooperative prefetching and caching in a globally-managed memory system. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, pages 33-43. ACM Press, June 1998.

Keywords: distributed shared memory, cooperative caching, parallel I/O, pario-bib

vydyanathan:pipeline:
N. Vydyanathan, G. Khanna, T. Kurc, U. Catalyurek, P. Wyckoff, J. Saltz, P. Sadayappan Naga Vydyanathan, Gaurav Khana, Tahsin M Kurc, Umit V Catalyurek, Pete Wyckoff, Joel H Saltz, and P. (Saday) Sadayappan. Use of PVFS for efficient execution of jobs with pipeline-shared I/O. In R Buyya, editor, Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing, pages 235-242, Pittsburgh, PA, November 2004. IEEE Computer Society Press.

Abstract: This paper is concerned with efficient execution of applications that are composed of chain of sequential data processes, which exchange data through a file system. We focus on pipeline-shared I/O behavior within a single pipeline of processes running on a cluster We examine several scheduling strategies and experimentally evaluate them for efficient use the Parallel Virtual File System (PVFS) as a common storage pool.

Keywords: PVFS, pipelined-shared I/O, grid computing, pario-bib

waltz:database:
David L. Waltz. Innovative massively parallel AI applications. In Proceedings of the 1993 DAGS/PC Symposium, pages 132-138, Hanover, NH, June 1993. Dartmouth Institute for Advanced Graduate Studies.

Abstract: Massively parallel applications must address problems that will be too large for workstations for the next several years, or else it will not make sense to expend development costs on them. Suitable applications include one or more of the following properties: 1) large amounts of data; 2) intensive computations; 3) requirement for very fast response times; 4) ways to trade computations for human effort, as in developing applications using learning methods. Most of the suitable applications that we have found come from the general area of very large databases. Massively parallel machines have proved to be important not only in being able to run large applications, but in accelerating development (allowing the use of simpler algorithms, cutting the time to test performance on realistic databases) and allowing many different algorithms and parameter settings to be tried and compared for a particular task. This paper summarizes four such applications.

The applications described are: 1) prediction of credit card "defaulters" (non-payers) and "attritters" (people who didn't renew their cards) from a credit card database; 2) prediction of the continuation of time series, e.g. stock price movements; 3) automatic keyword assignment for news articles; and 4) protein secondary structure prediction. These add to a list identified in an earlier paper [Waltz 90] including: 5) automatic classification of U.S. Census Bureau long forms, using MBR - Memory-Based Reasoning [Creecy et al 92, Waltz 89, Stanfill & Waltz 86]; 6) generating catalogs for a mail order company that maximize expected net returns (revenues from orders minus cost of the catalogs and mailings) using genetically-inspired methods; and 7) text-based intelligent systems for information retrieval, decision support, etc.

Keywords: database, AI, artificial intelligence, pario-bib

Comment: Invited speaker.

wang:paging:
Kuei Yu Wang and Dan C. Marinescu. An analysis of the paging activity of parallel programs, part I: Correlation of the paging activity of individual node programs in the SPMD execution mode. Technical Report CSD-TR-94-042, Purdue University, June 1994.

Keywords: parallel I/O, virtual memory, paging, characterization, pario-bib

Comment: They measured the paging behavior of programs running on a Paragon, and analyze the results. To do so, they sample the OSF paging statistics periodically. The general conclusions: they found a surprising amount of dissimilarity in the paging behaviors of nodes within the same program, both in terms of the amount of paging and the timing of peak paging activity. These characteristics do not bode well for systems that use gang scheduling, or applications that have a lot of barriers.

wang:workload:
Feng Wang, Qin Xin, Bo Hong, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Tyce T. McLarty. File system workload analysis for large scale scientific computing applications. In Proceedings of the Twentieth IEEE/Eleventh NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, April 2004. IEEE Computer Society Press.

Abstract: Parallel scientific applications require high-performance I/O support from underlying file systems. A comprehensive understanding of the expected workload is therefore essential for the design of high-performance parallel file systems. We re-examine the workload characteristics in parallel computing environments in the light of recent technology advances and new applications.

We analyze application traces from a cluster with hundreds of nodes. On average, each application has only one or two typical request sizes. Large requests from several hundred kilobytes to several megabytes are very common. Although in some applications, small requests account for more than 90% of all requests, almost all of the I/O data are transferred by large requests. All of these applications show bursty access patterns. More than 65% of write requests have inter-arrival times within one millisecond in most applications. By running the same benchmark on different file models, we also find that the write throughput of using an individual output file for each node exceeds that of using a shared file for all nodes by a factor of 5. This indicates that current file systems are not well optimized for file sharing.

Keywords: file system workload, workload characterization, ASCI, lustre, scientific applications, pario-app, pario-bib

Comment: An I/O workload study of three applications on a 960 node (dual-processors) cluster at LLNL running the lustre-light parallel file system. The applications include a I/O benchmarking code (ior2) and two physics simulations: one that ran on 343 processors and one that ran on 1620 processors.

watson:hpss:
Richard W. Watson and Robert A. Coyne. The parallel I/O architecture of the high-performance storage system (HPSS). In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 27-44. IEEE Computer Society Press, September 1995.

Abstract: Datasets up to terabyte size and petabyte total capacities have created a serious imbalance between I/O and storage-system performance and system functionality. One promising approach is the use of parallel data-transfer techniques for client access to storage, peripheral-to-peripheral transfers, and remote file transfers. This paper describes the parallel I/O architecture and mechanisms, parallel transport protocol (PTP), parallel FTP, and parallel client application programming interface (API) used by the high-performance storage system (HPSS). Parallel storage integration issues with a local parallel file system are also discussed.

Keywords: mass storage, parallel I/O, multiprocessor file system interface, pario-bib

weissman:smart:
Jon B. Weissman. Smart file objects: A remote file access paradigm. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems, pages 89-97, Atlanta, GA, May 1999. ACM Press.

Abstract: This paper describes a new scheme for remote file access called Smart File Objects (SFO). The SFO is an object-oriented application-specific file access paradigm designed to attack the bottleneck imposed by high latency, low bandwidth networks such as wide-area and wireless networks. The SFO uses application and network information to adaptively prefetch needed data in parallel with the execution of the application. The SFO can offer additional advantages such as non-blocking I/O, bulk I/O, improved file access APIs, and increased reliability. We describe the SFO concept, a prototype implementation in the Mentat system, and the results obtained with a distributed gene sequence application running across the Internet and vBNS. The results show the potential of the SFO approach to improve application performance.

Keywords: object, parallel I/O, pario-bib

wickremesinghe:active-storage:
Rajiv Wickremesinghe, Jeffrey S. Chase, and Jeffrey S. Vitter. Distributed computing with load-managed active storage. In Proceedings of the Eleventh IEEE International Symposium on High Performance Distributed Computing, pages 24-34, Edinburgh, Scotland, 2002. IEEE Computer Society Press.

Keywords: I/O, active storage, TPIE, grid, parallel I/O, pario-bib

Comment: Very interesting talk... an extension of the TPIE work. They assign a mapping of computations to storage-based processors. This stuff is very similar to Armada. They place "functors" that have bounded per-record processing and bounded internal state at the ASU (active storage unit). Since functors have bounded computation and state, they have predictive behavior (used for load balancing and scheduling). The extensions to TPIE include data aggregation primitives for sets (unordered data), streams (sequential data), and arrays (random access data). They also allow functors to process "packets" (groups of records) useful for applications like a merge sort. The example applications include the standard TPIE GIS app, along with a merge sort.

wiebalck:enbd:
Arne Wiebalck, Peter T. Breuer, Volker Lindenstruth, and Timm M. Steinbeck. Fault-tolerant distributed mass storage for lhc computing. In Proceedings of the Third IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 266-275, Tokyo, Japan, May 2003. IEEE Computer Society Press.

Abstract: In this paper we present the concept and first prototyping results of a modular fault-tolerant distributed mass storage architecture for large Linux PC clusters as they are deployed by the upcoming particle physics experiments. The device masquerading technique using an Enhanced Network Block Device (ENBD) enables local RAID over remote disks as the key concept of the ClusterRAID system. The block level interface to remote files, partitions or disks provided by the ENBD makes it possible to use the standard Linux software RAID to add fault-tolerance to the system. Preliminary performance measurements indicate that the latency is comparable to a local hard drive. With four disks throughput rates of up to 55MB/s were achieved with first prototypes for a RAID0 setup, and about 40MB/s for a RAID5 setup.

Keywords: RAID, fault-tolerance, high-energy physics, parallel I/O, pario-app, pario-bib

wilkes:autoraid:
John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan. The HP AutoRAID hierarchical storage system. ACM Transactions on Computer Systems, 14(1):108-136, February 1996.
See also earlier version wilkes:autoraid-sosp.
See also later version wilkes:bautoraid.

Keywords: RAID, disk array, parallel I/O, pario-bib

wilkes:autoraid-sosp:
John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan. The HP AutoRAID hierarchical storage system. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, pages 96-108, Copper Mountain, CO, December 1995. ACM Press.
See also later version wilkes:autoraid.

Keywords: RAID, disk array, parallel I/O, pario-bib

Comment: Cite wilkes:autoraid. A commercial RAID box that transparently manages a hierarchy of two RAID systems, a RAID-1 mirrored system and a RAID-5 system. The goal is easy-to-use high performance, and they appear to have achieved that goal. Data in current use are kept in the RAID-1, and other data in RAID-5. This design gives performance of RAID-1 with cost of RAID-5. They have a clever scheme for spreading both RAIDs across most disks, including a hot spare. Dual controllers, power supplies, fans, etc. The design is a fairly standard RAID hardware controller, using standard SCSI disks, but with all the new tricks done in controller software. The paper gives a few results from the prototype hardware, and a lot of simulation results.

wilkes:bautoraid:
John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan. The HP AutoRAID hierarchical storage system. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 7, pages 90-106. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version wilkes:autoraid.

Keywords: RAID, disk array, parallel I/O, pario-bib

Comment: Part of jin:io-book; reformatted version of wilkes:autoraid.

wilkes:datamesh:
John Wilkes. DataMesh- scope and objectives: a commentary. Technical Report HP-DSD-89-44, Hewlett-Packard, July 1989.
See also later version wilkes:datamesh1.

Keywords: parallel I/O, distributed file system, disk caching, heterogeneous file system, pario-bib

Comment: Hooks a heterogeneous set of storage devices together over a fast interconnect, each with its own identical processor. The whole would then act as a file server for a network. Data storage devices would range from fast to slow (e.g. optical jukebox), varying availability, etc.. Many ideas here but few concrete suggestions. Very little mention of algorithms they might use to control the thing. See also wilkes:datamesh1, cao:tickertaip, chao:datamesh, wilkes:houses, wilkes:lessons.

wilkes:datamesh1:
John Wilkes. DataMesh research project, phase 1. In Proceedings of the USENIX File Systems Workshop, pages 63-69, May 1992.
See also earlier version wilkes:datamesh.

Keywords: distributed file system, parallel I/O, disk scheduling, disk layout, pario-bib

Comment: See chao:datamesh

wilkes:datamesh2:
John Wilkes. The DataMesh research project. In P. Welch et al., editor, Transputing '91, pages 547-533. IOS Press, 1991.

Keywords: parallel I/O, RAID, disk striping, pario-bib

Comment: An overview report on the DataMesh project. It adds a little to the earlier reports such as wilkes:datamesh1. It has some performance results from simulation.

wilkes:houses:
John Wilkes. DataMesh, house-building, and distributed systems technology. ACM Operating Systems Review, 27(2):104-108, April 1993.
See also earlier version wilkes:datamesh1.
See also later version wilkes:lessons.

Keywords: file system, distributed computing, pario-bib

Comment: Same as wilkes:lessons. See that for comments.

wilkes:lessons:
John Wilkes. DataMesh, house-building, and distributed systems technology. In Proceedings of the 1993 DAGS/PC Symposium, pages 1-5, Hanover, NH, June 1993. Dartmouth Institute for Advanced Graduate Studies.
See also earlier version wilkes:houses.

Keywords: file system, parallel I/O, RAID, disk array, pario-bib

Comment: Invited speaker. Also appeared in ACM OSR April 1993 (wilkes:houses). This gives his viewpoint that we should be focusing more on architecture than on components, to design frameworks rather than just individual policies and mechanisms. It also gives a quick overview of DataMesh. For more DataMesh info, though, see cao:tickertaip, chao:datamesh, wilkes:datamesh1, wilkes:datamesh, wilkes:houses.

willeman:pario:
Ray Willeman, Susan Phillips, and Ron Fargason. An integrated library for parallel processing: The input/output component. In Proceedings of the Fourth Conference on Hypercube Concurrent Computers and Applications, pages 573-575, Monterey, CA, 1989. Golden Gate Enterprises, Los Altos, CA.

Keywords: parallel I/O, pario-bib

Comment: Like the CUBIX interface, in some ways. Meant for parallel access to non-striped (sequential) file. Self-describing format so that the reader can read the formatting information and distribute data accordingly.

wisniewski:in-place:
Leonard F. Wisniewski. Structured permuting in place on parallel disk systems. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 128-139, Philadelphia, May 1996. ACM Press.

Abstract: The ability to perform permutations of large data sets in place reduces the amount of necessary available disk storage. The simplest way to perform a permutation often is to read the records of a data set from a source portion of data storage, permute them in memory, and write them to a separate target portion of the same size. It can be quite expensive, however, to provide disk storage that is twice the size of very large data sets. Permuting in place reduces the expense by using only a small amount of extra disk storage beyond the size of the data set.

\newcommand{\ceil}[1]{\lceil #1\rceil} \newcommand{\rank}[1]{\mathop{\rm rank}\nolimits #1} This paper features in-place algorithms for commonly used structured permutations. We have developed an asymptotically optimal algorithm for performing BMMC (bit-matrix-multiply/complement) permutations in place that requires at most $\frac{2N}{BD}\left( 2\ceil{\frac{\rank{γ}}{\lg (M/B)}} + \frac{7}{2}\right)$ parallel disk accesses, as long as $M \geq 2BD$, where $N$ is the number of records in the data set, $M$ is the number of records that can fit in memory, $D$ is the number of disks, $B$ is the number of records in a block, and $γ$ is the lower left $\lg (N/B) \times \lg B$ submatrix of the characteristic matrix for the permutation. This algorithm uses $N+M$ records of disk storage and requires only a constant factor more parallel disk accesses and insignificant additional computation than a previously published asymptotically optimal algorithm that uses $2N$ records of disk storage.

We also give algorithms to perform mesh and torus permutations on a $d$-dimensional mesh. The in-place algorithm for mesh permutations requires at most $3\ceil{N/BD}$ parallel I/Os and the in-place algorithm for torus permutations uses at most $4dN/BD$ parallel I/Os. The algorithms for mesh and torus permutations require no extra disk space as long as the memory size $M$ is at least $3BD$. The torus algorithm improves upon the previous best algorithm in terms of both time and space.

Keywords: parallel I/O, parallel I/O algorithm, permutation, out-of-core, pario-bib

wisniewski:mpiio:
Len Wisniewski, Brad Smisloff, and Nils Nieuwejaar. Sun MPI I/O: Efficient I/O for parallel applications. In Proceedings of SC99: High Performance Networking and Computing, Portland, OR, November 1999. ACM Press and IEEE Computer Society Press.

Keywords: MPI I/O, parallel file system, pario-bib

Comment: They describe the port of MPI I/O to the Sun Parallel File system (a direct descendent of Galley).

wisniewski:sun-mpi-io:
Len Wisniewski, Brad Smisloff, and Nils Nieuwejaar. Sun MPI I/O: Efficient I/O for parallel applications, November 1999.

Keywords: MPI I/O, parallel I/O, multiprocessor file system interface, pario-bib

witkowski:hyper-fs:
Andrew Witkowski, Kumar Chandrakumar, and Greg Macchio. Concurrent I/O system for the hypercube multiprocessor. In Proceedings of the Third Conference on Hypercube Concurrent Computers and Applications, pages 1398-1407, Pasadena, CA, 1988. ACM Press.

Keywords: parallel I/O, hypercube, parallel file system, pario-bib

Comment: Concrete system for the hypercube. Files resident on one disk only. Little support for cooperation except for sequentialized access to parts of the file, or broadcast. No mention of random-access files. I/O nodes are distinguished from computation nodes. I/O nodes have separate comm. network. No parallel access. I/O hooked to front-end too.

wolf:dasd:
Joel L. Wolf, Philip S. Yu, and Hadas Shachnai. DASD dancing: A disk load balancing optimization scheme for video-on-demand computer systems. In Proceedings of the 1995 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 157-166, May 1995.

Keywords: parallel I/O, video server, multimedia, pario-bib

wolman:iobench:
Barry L. Wolman and Thomas M. Olson. IOBENCH: a system independent IO benchmark. Computer Architecture News, 17(5):55-70, September 1989.

Keywords: I/O benchmark, transaction processing, pario-bib

Comment: Not about parallel I/O, but see olson:random. Defines a new I/O benchmark that is fairly system-independent. Focus is for transaction processing systems. Cranks up many tasks (users) all doing repetitive read/writes for a specified time, using optional locking, and optional computation. Whole suite of results for comparison with others. See also chen:iobench.

womble:intro:
David E. Womble and David S. Greenberg. Parallel I/O: An introduction. Parallel Computing, 23(4):403-417, June 1997.

Keywords: parallel I/O, pario-bib

Comment: A brief introduction to the topic of parallel I/O (what, why, current research), followed by a roundtable discussion among the authors of the papers in womble:special-issue. The discussion focused on three questions: 1) What are the biggest gaps in current I/O services? 2) Why have vendors failed to adopt new file system technologies? 3) How much direct low-level control over I/O resources should be given to the users and why?

womble:outofcore:
David Womble, David Greenberg, Rolf Riesen, and Stephen Wheat. Out of core, out of mind: Practical parallel I/O. In Proceedings of the Scalable Parallel Libraries Conference, pages 10-16, Mississippi State University, October 1993.

Abstract: Parallel computers are becoming more powerful and more complex in response to the demand for computing power by scientists and engineers. Inevitably, new and more complex I/O systems will be developed for these systems. In particular we believe that the I/O system must provide the programmer with the ability to explicitly manage storage (despite the trend toward complex parallel file systems and caching schemes). One method of doing so is to have a partitioned secondary storage in which each processor owns a logical disk. Along with operating system enhancements which allow overheads such as buffer copying to be avoided and libraries to support optimal remapping of data, this sort of I/O system meets the needs of high performance computing.

Keywords: parallel I/O, parallel file system, pario-bib

Comment: They argue that it is important to allow the programmer to explicitly control their storage in some way. In particular, they advocate the Partitioned Secondary Storage (PSS) model, in which each processor has its own logical disk, rather than using a parallel file system (PFS) which automatically stripes a linear file across many disks. Basically, programmer knows best. Of course, libraries can help. They note that you will often need data in a different format than it comes, and may need it output in a different format; so, permutation algorithms are needed. Also important to be able to overlap computation with I/O. They use LU factorization as an example, and give an algorithm. On the nCUBE with the PUMA operating system, they get good performance. See womble:pario.

womble:pario:
David Womble, David Greenberg, Stephen Wheat, and Rolf Riesen. Beyond core: Making parallel computer I/O practical. In Proceedings of the 1993 DAGS/PC Symposium, pages 56-63, Hanover, NH, June 1993. Dartmouth Institute for Advanced Graduate Studies.

Abstract: The solution of Grand Challenge Problems will require computations which are too large to fit in the memories of even the largest machines. Inevitably, new designs of I/O systems will be necessary to support them. Through our implementations of an out-of-core LU factorization we have learned several important lessons about what I/O systems should be like. In particular we believe that the I/O system must provide the programmer with the ability to explicitly manage storage. One method of doing so is to have a partitioned secondary storage in which each processor owns a logical disk. Along with operating system enhancements which allow overheads such as buffer copying to be avoided, this sort of I/O system meets the needs of high performance computing.

Keywords: parallel I/O, out-of-core, parallel algorithm, scientific computing, multiprocessor file system, pario-bib

Comment: See womble:outofcore. See thakur:runtime, kotz:lu, brunet:factor for other out-of-core LU results.

womble:special-issue:
David E. Womble and David S. Greenberg. Parallel I/O. Parallel Computing, 23(4):iii, June 1997. Introduction to a special issue.

Keywords: parallel I/O, pario-bib

Comment: A one-page introduction to this special issue of Parallel Computing, which includes many papers about parallel I/O. See also womble:intro, nieuwejaar:jgalley, moore:ddio, barve:jmergesort, miller:jrama, schwabe:jlayouts, parsons:templates, cormen:early-vic, carretero:performance,

wong:benchmarks:
Parkson Wong and Rob F Van der Wijngaart. NAS parallel benchmarks I/O version 2.4. Technical Report NAS-03-002, Computer Sciences Corporation, NASA Advanced Supercomputing (NAS) Division, NASA Ames Research Center, Moffett Field, CA 94035-1000, January 2003.

Abstract: We describe a benchmark problem, based on the Block-Tridiagonal (BT) problem of the NAS Parallel Benchmarks (NPB), which is used to test the output capabilities of high-performance computing systems, especially parallel systems. We also present a source code implementation of the benchmark, called NPBIO2.4-MPI, based on the MPI implementation of the NPB, using a variety of ways to write the computed solutions to file.

Keywords: parallel I/O benchmarks, block tridiagonal, pario-app, pario-bib

woodward:scivi:
Paul R. Woodward. Interactive scientific visualization of fluid flow. IEEE Computer, 26(10):13-25, October 1993.

Keywords: parallel I/O architecture, scientific visualization, pario-bib

Comment: This paper is interesting for its impressive usage of RAIDs and parallel networks to support scientific visualization. In particular, the proposed Gigawall (a 10-foot by 6-foot gigapixel-per-second display) is run by 24 SGI processors and 32 9-disk RAIDs, connected to an MPP of some kind through an ATM switch. 512 GBytes of storage, playable at 450 MBytes per second, for 19 minutes of animation.

worringen:improving:
Joachim Worringen, Jesper Larsson Träff, and Hubert Ritzdorf. Improving generic non-contiguous file access for MPI-IO. Lecture Notes in Computer Science, 2840:309-318, October 2003.

Abstract: We present a fundamental improvement of the generic techniques for non-contiguous file access in MPI-IO. The improvement consists in the replacement of the conventional data management algorithms based on a representation of the non-contiguous fileview as a list of (offset, length) tuples. The improvement is termed listless i/o as it instead makes use of space- and time-efficient datatype handling functionality that is completely free of lists for processing non-contiguous data in the file or in memory. Listless i/o has been implemented for both independent and collective file accesses and improves access performance by increasing the data throughput between user buffers and file buffers. Additionally, it reduces the memory footprint of the process performing non-contiguous I/O. In this paper we give results for a synthetic benchmark on a PC cluster using different file systems. We demonstrate improvements in I/O bandwidth that exceed a factor of 10.

Keywords: access patterns, MPI-IO, listless I/O, pario-bib

Comment: Also see worringen:non-contiguous

worringen:non-contiguous:
Joachim Worringen, Jesper Larson Träff, and Hubert Ritzdorf. Fast parallel non-contiguous file access. In Proceedings of SC2003: High Performance Networking and Computing, Phoenix, AZ, November 2003. IEEE Computer Society Press.

Abstract: Many applications of parallel I/O perform non-contiguous file accesses, but only few file system interfaces support non-contiguous access. In contrast, the most commonly used parallel programming interface, MPI, supports parallel I/O through its MPI-IO interface. Within this interface, non-contiguous accesses are supported by the use of derived MPI datatypes. Unfortunately, current MPI-IO implementations suffer from low performance of such non-contiguous accesses when compared to the performance of the storage system for contiguous accesses although a considerable amount of work has been done in this area. In this paper we analyze an important bottleneck in current implementations of MPI-IO, and present a new technique termed listless i/o to perform non-contiguous access with MPI-IO. On the NEC SX-series of parallel vector computers, listless i/o is able to increase the bandwidth for non-contiguous file access by sometimes more than a factor of 500 when compared to the traditional approach.

Keywords: parallel I/O interface, file access patterns, pario-bib

Comment: published on the web

worringen:sci-io:
Joachim Worringen. Efficient parallel I/O on SCI connected clusters.. In IEEE International Conference on Cluster Computing. CLUSTER 2000, pages 371-372, Chemnitz, Germany, December 2000. Los Alamitos, CA, USA : IEEE Comput. Soc, 2000.

Abstract: This paper presents a new approach towards parallel I/O for message-passing (MPI) applications on clusters built with commodity hardware and an SCI interconnect: instead of using the classic scheme of clients and a number of servers communicating via TCP/IP, a pure peer-to-peer communication topology based on efficient use of the underlying SCI interconnect is presented. Every process of the MPI application is client as well as server for I/O operations. This allows for a maximum of locality in file access, while the accesses to remote portions of the distributed file are performed via distributed shared memory techniques. A server is only required to manage the initial distribution of the file fragments between the participating nodes and to provide services like external access and redundancy. (5 refs.)

Keywords: parallel I/O, MPI-IO, SCI connected clusters, pario-bib

Comment: Short paper and a poster. Poster URL=.

wu:noncontiguous:
Jiesheng Wu, Pete Wyckoff, and Dhabaleswar Panda. Supporting efficient noncontiguous access in PVFS over Infiniband. In Proceedings of the IEEE International Conference on Cluster Computing, pages 344-351, Hong Kong, China, December 2003. IEEE Computer Society Press.

Abstract: Noncontiguous I/O access is the main access pattern in many scientific applications. Noncontiguity exists both in access to files and in access to target memory regions on the client. This characteristic imposes a requirement of native noncontiguous I/O access support in cluster file systems for high performance. In this paper we address noncontiguous data transmission between the client and the I/O server in cluster file systems over a high performance network. We propose a novel approach, RDMA Gather/Scatter, to transfer noncontiguous data for such I/O accesses. We also propose a new scheme, Optimistic Group Registration, to reduce memory registration costs associated with this approach. We have designed and incorporated this approach in a version of PVFS over InfiniBand. Through a range of PVFS and MPI-IO micro-benchmarks, and the NAS BTIO benchmark, we demonstrate that our approach attains significant performance gains compared to other existing approaches.

Keywords: noncontiguous access patterns, PVFS, Infiniband, RDMA, pario-bib

wu:thrashing:
Kun-Lung Wu, Philip S. Yu, and James Z. Teng. Performance comparison of thrashing control policies for concurrent mergesorts with parallel prefetching. In Proceedings of the 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 171-182, 1993.

Keywords: disk prefetching, parallel I/O, disk caching, sorting, pario-bib

Comment: They discuss prefetching and caching in database machines where mergesorts merge several input streams, each from its own disk, to one output stream, to its own disk. There are concurrent merges going on. A merge can cause thrashing when writes grab a clean buffer that holds an unused prefetch, thus forcing the block to later be read again. They consider several policies to handle this, but it seemed to me like they missed an obvious alternative, that may have been better: whenever you need a clean buffer to write into, but all the clean buffers hold unused-prefetched blocks, stall and wait while the dirty blocks are flushed (presumably started earlier when the clean-block count got too low). It seems better to wait for a half-finished write than to toss out a prefetched block and later have to read it again. Their simulations show that their techniques help a lot.

yamamoto:astronomical:
Naotaka Yamamoto, Osama Tatebe, and Satoshi Sekiguchi. Parallel and distributed astronomical data analysis on grid datafarm. In R Buyya, editor, 5th International Workshop on Grid Computing, pages 461-466, Pittsburgh, PA, November 2004. IEEE Computer Society Press.

Abstract: A comprehensive study of the whole petabyte-scale archival data of astronomical observatories has a possibility of new science and new knowledge in the field, while it was not feasible so far due to lack of enough data analysis environment. The Grid Datafarm architecture is designed for global petabyte-scale data-intensive computing, which provides a Grid file system with file replica management for fault tolerance and load balancing, and parallel and distributed data computing support for a set of files, to meet with the requirements of the comprehensive study of the whole archival data. In the paper, we discuss about worldwide parallel and distributed data analysis in the observational astronomical field The archival data is stored, replicated and dispersed in a Gfarm file system. All the astronomical data analysis tools successfully access files in Gfarm file system without any code modification, using a syscall hooking library regardless of file replica locations. Performance evaluation of the parallel data analysis in several ways shows file-affinity process scheduling plays an essential role for scalable and efficient parallel file I/O performance. A data calibration tools shows scalable file I/O performance, and achieved the file I/O performance of 5.9 GB/sec and 4.0 GB/sec for reading and writing FITS files, respectively, using 30 cluster nodes (60 CPUs). On-demand file replica creation mitigates the overhead of access concentration. Another tool shows the performance improvement at a factor of six for reading a shared file by creating file replicas.

Keywords: grid, grid datafarm, astronomical data, pario-app, pario-bib

yang:construction:
Chao-Tung Yang, Chien-Tung Pan, Kuan-Ching Li, and Wen-Kui Chang. On construction of a large file system using PVFS for grid. Lecture Notes in Computer Science, 3320:860-863, December 2004.

Abstract: Grid is the largest advance of network after Internet since the Grid System provides a specialty that can be used popularly and effectively. However, it is a challenge to the consistency and community of use on the data storages space of a Grid System. Therefore, the problem of application for the Computational Grid and Data Grid is more important. It can set up a usability, expandability, high operation capability, and large memory space in Grid with the Cluster system and parallel technique in order to solve the problem. In this paper, we provided a Grid with high operation capability and higher memories to solve the problem. As to the Grid setting, we take use of the Cluster computing to increase the operation effect for computing, and a PVFS2 with more storages effect for data. It can supply a quite correct platform for Grid user whether for large data access or huge operation.

Keywords: grid I/O, PVFS2, cluster file system, pario-bib

yokota:nets:
Haruo Yokota. DR-Nets: Data-reconstruction networks for highly reliable parallel disk systems. In Proceedings of the IPPS '94 Workshop on Input/Output in Parallel Computer Systems, pages 105-116. Japan Advanced Institute of Science and Technology (JAIST), April 1994. Also appeared in Computer Architecture News 22(4).
See also later version yokota:nets-book.

Keywords: parallel I/O, pario-bib

Comment: They propose to link a set of disks with its own interconnect, e.g., a torus, to allow the disks to communicate to compute multi-dimensional parity and to respond to disk failures, without using the primary interconnect of the multiprocessor or distributed system. In this sense it is reminiscent of TickerTAIP or DataMesh.

yokota:nets-book:
Haruo Yokota and Yasuyuki Mimatsu. A scalable disk system with data reconstruction functions. In Jain et al. [iopads-book], chapter 16, pages 353-372.
See also earlier version yokota:nets.

Abstract: Scalable disk systems are required to implement well-balanced computer systems. We have proposed DR-nets, Data-Reconstruction networks, to construct the scalable parallel disk systems with high reliability. Each node of a DR-net has disks, and is connected by links to form an interconnection network. To realize the high reliability, nodes in a sub-network of the interconnection network organize a group of parity calculation proposed for RAIDs. Inter-node communication for calculating parity keeps the locality of data transfer, and it inhibits bottlenecks from occurring, even if the size of the network becomes very large. We have developed an experimental system using Transputers. In this chapter, we provide execution models for estimating the response time and throughput of DR-nets, and compare them to experimental results. We also discuss the reliability of the DR-nets and RAIDs.

Keywords: parallel I/O architecture, disk array, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

youssef:thesis:
Rachad Youssef. RAID for mobile computers. Master's thesis, Carnegie Mellon University Information Networking Institute, August 1995. Available as INI-TR 1995-3.

Keywords: parallel I/O, disk array, RAID, mobile computing, pario-bib

Comment: low-power, highly available disk arrays for mobile computers.

yu:modeling:
S. Yu, M. Winslett, J. Lee, and X. Ma. Automatic and portable performance modeling for parallel I/O: A machine-learning approach. ACM SIGMETRICS Performance Evaluation Review, 30(3):3-5, December 2002.

Abstract: A performance model for a parallel I/O system is essential for detailed performance analyses, automatic performance optimization of I/O request handling, and potential performance bottleneck identification. Yet how to build a portable performance model for parallel I/O system is an open problem. In this paper, we present a machine-learning approach to automatic performance modeling for parallel I/O systems. Our approach is based on the use of a platform- independent performance metamodel, which is a radial basis function neural network. Given training data, the metamodel generates a performance model automatically and efficiently for a parallel I/O system on a given platform. Experiments suggest that our goal of having the generated model provide accurate performance predictions is attainable, for the parallel I/O library that served as our experimental testbed on an IBM SP. This suggests that it is possible to model parallel I/O system performance automatically and portably, and perhaps to model a broader class of storage systems as well.

Keywords: parallel I/O, performance model, pario-bib

yu:trading:
Xiang Yu, Benjamin Gum, Yuqun Chen, Randolph Y. Wang, Kai Li, Arvind Krishnamurthy, and Thomas E. Anderson. Trading capacity for performance in a disk array. In Proceedings of the 2000 Symposium on Operating Systems Design and Implementation, pages 243-258, San Diego, October 2000. USENIX Association.

Abstract: A variety of performance-enhancing techniques, such as striping, mirroring, and rotational data replication, exist in the disk array literature. Given a fixed budget of disks, one must intelligently choose what combination of these techniques to employ. In this paper, we present a way of designing disk arrays that can flexibly and systematically reduce seek and rotational delay in a balanced manner. We give analytical models that can guide an array designer towards optimal configurations by considering both disk and workload characteristics. We have implemented a prototype disk array that incorporates the configuration models. In the process, we have also developed a robust disk head position prediction mechanism without any hardware support. The resulting prototype demonstrates the effectiveness of the configuration models.

Keywords: disk array, file system, parallel I/O, pario-bib

zabback:reorg:
Peter Zabback, Ibrahim Onyuksel, Peter Scheuermann, and Gerhard Weikum. Database reorganization in parallel disk arrays with I/O service stealing. IEEE Transactions on Knowledge and Data Engineering, 10(5):855-858, September/October 1998.

Keywords: parallel I/O, disk array, database, disk reorganization, pario-bib

zajcew:osf1:
Roman Zajcew, Paul Roy, David Black, Chris Peak, Paulo Guedes, Bradford Kemp, John LoVerso, Michael Leibensperger, Michael Barnett, FaraMarz Rabii, and Durriya Netterwala. An OSF/1 UNIX for massively parallel multicomputers. In Proceedings of the 1993 Winter USENIX Technical Conference, pages 449-468, January 1993.

Keywords: unix, parallel operating system, multiprocessor file system, pario-bib

Comment: Describing the changes to OSF/1 to make OSF/1 AD TNC, primarily intended for NORMA MIMD multicomputers. Enhancements include a new file system, distributed implementation of sockets, and process management. The file system still has traditional file systems, each in its own partition, with a global name space built by mounting file systems on each other. The change is that mounts can be remote, ie, managed by a different file server on another node. They plan to use prefix tables for pathname translation (welch:prefix,nelson:sprite). They use a token-based protocol to provide atomicity of read and write calls, and to maintain consistency of client-node caches. See also roy:unixfile. Process enhancements include a new SIGMIGRATE, rfork(), and rforkmulti().

zhang:n-spek:
Ming Zhang and Qing Yang. Performability evaluation of networked storage systems using n-spek. In Workshop on Parallel I/O in Cluster Computing and Computational Grids, pages 736-741, Tokyo, May 2003. IEEE Computer Society Press. Organized at the IEEE/ACM International Symposium on Cluster Computing and the Grid 2003.

Abstract: This paper introduces a new benchmark tool for evaluating performance and availability (performability) of networked storage systems, specifically storage area network (SAN) that is intended for providing block-level data storage with high performance and availability. The new benchmark tool, named N-SPEK (Networked-Storage Performability Evaluation Kernel module), consists of a controller, several workers, one or more probers, and several fault injection modules. N-SPEK is highly accurate and efficient since it runs at kernel level and eliminates skews and overheads caused by file systems. It allows a SAN architect to generate configurable storage workloads to the SAN under test and to inject different faults into various SAN components such as network devices, storage devices, and controllers. Available performances under different workloads and failure conditions are dynamically collected and recorded in the N-SPEK over a spectrum of time. To demonstrate its functionality, we apply N-SPEK to evaluate the performability of a specific iSCSI-based SAN under Linux environment. Our experiments show that N-SPEK not only efficiently generates quantitative performability results but also reveals a few optimization opportunities for future iSCSI implementations.

Keywords: benchmarking, performance, block-level access, pario-bib

zhou:greedy:
Xinrong Zhou and Tong Wei. A greedy I/O scheduling method in the storage system of clusters. In Workshop on Parallel I/O in Cluster Computing and Computational Grids, pages 712-717, Tokyo, May 2003. IEEE Computer Society Press. Organized at the IEEE/ACM International Symposium on Cluster Computing and the Grid 2003.

Abstract: As the size of cluster becomes larger, the process ability of a cluster increases rapidly. Users will exploit this increased power to run scientific, physical and multimedia applications. These kinds of data-intensive applications require high performance storage subsystem. Parallel storage system such as RAID is widely used in today's clusters. In this paper, we bring out a "greedy" I/O scheduling method that utilizes Scatter and Gather operations inside the PCI-SCSI adapter to combine as many I/O operations within the same disk as possible. In this way we reduce the numbers of I/O operations and improve the performance of the whole storage system. After analyzing RAID control strategy, we find out that I/O commands' combination may also bring up data movement in memory and this kind of movement will increase the system's overhead. The experiment results in our real time operating environment show that a better performance can be achieved. The longer the data length is, the better improvement we can get, in some case, we can even get over 40% enhancement.

Keywords: parallel I/O, disk scheduling, pario-bib

zhou:threads:
Yuanyuan Zhou, Limin Wang, Douglas W. Clark, and Kai Li. Thread scheduling for out-of-core applications with memory server on multicomputers. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems, pages 57-67, Atlanta, GA, May 1999. ACM Press.

Abstract: Out-of-core applications perform poorly in paged virtual memory (VM) systems because demand paging involves slow disk I/O accesses. Much research has been done on reducing the I/O overhead in such applications by either reducing the number of I/Os or lowering the cost of each I/O operation. In this paper, we investigate a method that combines fine-grained threading with a memory server model to improve the performance of out-of-core applications on multicomputers. The memory server model decreases the average cost of I/O operations by paging to remote memory, while the fine-grained thread scheduling reduces the number of I/O accesses by improving the data locality of applications. We have evaluated this method on an Intel Paragon with 7 applications. Our results show that the memory server system performs better than the VM disk paging by a factor of 5 for sequential applications and a factor of 1.5 to 2.2 for parallel applications. The fine-grained threading alone improves the VM disk paging performance by a factor of 10 and 1.2 to 3 respectively for sequential and parallel applications. Overall, the combination of these two techniques outperforms the VM disk paging by more than a factor of 12 for sequential applications and a factor of 3 to 6 for parallel applications.

Keywords: threads, scheduling, memory, out-of-core application, parallel I/O, pario-bib

zhu:case-study:
Y. F. Zhu, H. Jiang, X. Qin, and D. Swanson. A case study of parallel i/o for biological sequence search on linux clusters. In Proceedings of the IEEE International Conference on Cluster Computing, pages 308-315, Hong Kong, China, December 2003. IEEE Computer Society Press.

Abstract: In this paper we analyze the I/O access patterns of a widely-used biological sequence search tool and implement two variations that employ parallel-I/O for data access based on PVFS (Parallel Virtual File System) and CEFT-PVFS (Cost-Effective Fault-Tolerant PVFS). Experiments show that the two variations outperform the original tool when equal or even fewer storage devices are used in the former. It is also found that although the performance of the two variations improves consistently when initially increasing the number of servers, this performance gain from parallel I/O becomes insignificant with further increase in server number. We examine the effectiveness of two read performance optimization techniques in CEFT-PVFS by using this tool as a benchmark. Performance results indicate: (1) Doubling the degree of parallelism boosts the read performance to approach that of PVFS; (2) Skipping hotspots can substantially improve the I/O performance when the load on data servers is highly imbalanced. The I/O resource contention due to the sharing of server nodes by multiple applications in a cluster has been shown to degrade the performance of the original tool and the variation based on PVFS by up to 10 and 21 folds, respectively; whereas, the variation based on CEFT-PVFS only suffered a two-fold performance degradation.

Keywords: BLAST, CEFT-PVFS, parallel I/O, PVFS, application, characterization, I/O access patterns, biology application, pario-app, pario-bib

zhu:ceft-pvfs:
Yifeng Zhu, Hong Jiang, Xiao Qin, Dan Feng, and David R. Swanson. Improved read performance in a cost-effective, fault-tolerant parallel virtual file system (CEFT-PVFS). In Workshop on Parallel I/O in Cluster Computing and Computational Grids, pages 730-735, Tokyo, May 2003. IEEE Computer Society Press. Organized at the IEEE/ACM International Symposium on Cluster Computing and the Grid 2003.

Abstract: Due to the ever-widening performance gap between processors and disks, I/O operations tend to become the major performance bottleneck of data-intensive applications on modern clusters. If all the existing disks on the nodes of a cluster are connected together to establish high performance parallel storage systems, the cluster's overall performance can be boosted at no additional cost. CEFT-PVFS (a RAID 10 style parallel file system that extends the original PVFS), as one such system, divides the cluster nodes into two groups, stripes the data across one group in a round-robin fashion, and then duplicates the same data to the other group to provide storage service of high performance and high reliability. Previous research has shown that the system reliability is improved by a factor of more than 40 with mirroring while maintaining a comparable write performance. This paper presents another benefit of CEFT-PVFS in which the aggregate peak read performance can be improved by as much as 100% over that of the original PVFS by exploiting the increased parallelism.

Additionally, when the data servers, which typically are also computational nodes in a cluster environment, are loaded in an unbalanced way by applications running in the cluster, the read performance of PVFS will be degraded significantly. On the contrary, in the CEFT-PVFS, a heavily loaded data server can be skipped and all the desired data is read from its mirroring node. Thus the performance will not be affected unless both the server node and its mirroring node are heavily loaded.

Keywords: parallel I/O, fault-tolerance, read performance, parallel file system, PVFS, pario-bib