BibTeX bibliography file: Parallel I/O

Ongoing Edition
Last updated: 2005.Mar.02

This edition supercedes my older bibliographies.  

This bibliography is available on the WWW at
	http://www.cs.dartmouth.edu/pario/bib/
and by ftp at
	ftp://ftp.cs.dartmouth.edu/pub/pario/pario.bib
Both of which are easily reached by the Parallel-I/O Archive at 
	http://www.cs.dartmouth.edu/pario/

This bibliography covers parallel I/O, with a significant emphasis on
file systems rather than, say, network or graphics I/O.  This includes
architecture, operating systems, some algorithms, some databases, and
some workload characterization.  Because of the expanding nature of
this field, I cannot cover everything, and this bibliography is
admittedly spotty on topics like disk arrays, parallel databases, and
parallel networking.

The entries are alphabetized by cite key.  The emphasis is on
including everything I have, rather than selecting a few key articles
of interest.  Thus, you probably don't want (or need) to read
everything here.  There are many repeated entries, in the sense that a
paper is often published first as a TR, then in a conference, then in
a journal.  The "earlier" and "later" tags tie together versions of a
paper. 

Except where noted, all comments are mine, and any opinions expressed
there are mine only.  In some cases I am simply restating the opinion
or result obtained by the paper's authors, and thus even I might
disagree with the statement.  I keep most editorial comments to a
minimum.

Please send any additions or corrections (new abstracts and URLs would
be great!) to me at this address (I have to hide the address in a
string so bibtex won't complain about this header):
@string{parallel-io-bib = "parallel-io-bib@listserv.dartmouth.edu"}
Indeed, if you want to get updates to the bibliography (released once
per week), subscribe to the parallel-io-bib mailing list by sending an email to
@string {"LISTSERV@LISTSERV.DARTMOUTH.EDU"} % have to hide this from bibtex
with "SUBSCRIBE PARALLEL-IO-BIB" in the body of the message. To adjust your
subscription preferences, goto the site

http://listserv.dartmouth.org/scripts/wa.exe?SUBED1=PARALLEL-IO-BIB

You may use the bibliography as you please except for publishing it
as a whole, since the compilation is mine.

Please leave this header on the collection; BibTeX won't mind. 

Ron Oldfield
Sandia National Laboratories, Org 9221
P.O. Box 5800 
Albuquerque, NM 87185-1110

URL: http://www.cs.dartmouth.edu/~raoldfi/ 
505-284-9153
@string {email = "raoldfi@sandia.gov"} % have to hide this from bibtex

% BibTeX bibliography file

@InProceedings{abali:ibm370,
  author = {B\"{u}lent Abali and Bruce D. Gavril and Richard W. Hadsell and
  Linh Lam and Brion Shimamoto},
  title = {{Many/370: A} Parallel Computer Prototype for {I/O} Intensive
  Applications},
  booktitle = {Proceedings of the Sixth Annual Distributed-Memory Computer
  Conference},
  year = {1991},
  pages = {728--730},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {Describes a parallel IBM/370, where they attach several small 370s
  to a switch, and several disks to each 370. Not much in the way of striping.}
}

@InProceedings{abawajy:scheduling,
  author = {J. H. Abawajy},
  title = {Performance Analysis of Parallel {I/O} Scheduling Approaches on
  Cluster Computing Systems},
  booktitle = {Workshop on Parallel I/O in Cluster Computing and Computational
  Grids},
  year = {2003},
  month = {May},
  pages = {724--729},
  organization = {Carleton University},
  publisher = {IEEE Computer Society Press},
  address = {Tokyo},
  note = {Organized at the IEEE/ACM International Symposium on Cluster
  Computing and the Grid 2003},
  URL =
  {http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190724abs.htm},
  keywords = {parallel I/O, I/O scheduling algorithms, pario-bib},
  abstract = {As computation and communication hardware performance continue to
  rapidly increase, I/O represents a growing fraction of application execution
  time. This gap between the I/O subsystem and others is expected to increase
  in future since I/O performance is limited by physical motion. Therefore, it
  is imperative that novel techniques for improveing I/O performance be
  developed. Parallel I/O is a promising approach to alleviating this
  bottleneck. However, very little work exist with respect to scheduling
  parallel I/O operations explicitly. In this paper, we address the problem of
  effective management of parallel I/O in cluster computing systems by using
  appropriate I/O scheduling strategies. We propose two new I/O scheduling
  algorithms and compare them with two existing scheduling Approaches. The
  preliminary results show that the proposed policies outperform existing
  policies substantially.}
}

@Book{abello:dimacs,
  title = {External Memory Algorithms and Visualization},
  booktitle = {External Memory Algorithms and Visualization},
  editor = {James Abello and Jeffrey Scott Vitter},
  year = {1999},
  series = {DIMACS Series in Discrete Mathematics and Theoretical Computer
  Science},
  publisher = {American Mathematical Society Press},
  address = {Providence, RI},
  keywords = {parallel I/O, out-of-core algorithm, pario-bib},
  comment = {See also the component papers vitter:survey, arge:lower,
  crauser:segment, grossi:crosstrees, toledo:survey. Not clear to what extent
  these papers are about *parallel* I/O.}
}

@InProceedings{abello:graph,
  author = {James Abello and Adam L. Buchsbaum and Jeffrey R. Westbrook},
  title = {A Functional Approach to External Memory Graph Algorithms},
  booktitle = {Proceedings of the 6th Annual European Symposium on Algorithms},
  year = {1998},
  month = {August},
  series = {Lecture Notes in Computer Science},
  volume = {1461},
  pages = {332--343},
  publisher = {Springer-Verlag},
  address = {Venice, Italy},
  URL =
  {http://link.springer.de/link/service/series/0558/bibs/1461/14610332.htm},
  keywords = {out-of-core algorithm, graph, pario-bib}
}

@Article{abu-safah:speedup,
  author = {Walid Abu-Safah and Harlan Husmann and David Kuck},
  title = {On {Input/Output} Speed-up in Tightly-coupled Multiprocessors},
  journal = {IEEE Transactions on Computers},
  year = {1986},
  pages = {520--530},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, I/O, pario-bib},
  comment = {Derives formulas for the speedup with and without I/O considered
  and with parallel software and hardware format conversion. Considering I/O
  gives a more optimistic view of the speedup of a program {\em assuming} that
  the parallel version can use its I/O bandwidth as effectively as the serial
  processor. Concludes that, for a given number of processors, increasing the
  I/O bandwidth is the most effective way to speed up the program (over the
  format conversion improvements).}
}

@InProceedings{acharya:tuning,
  author = {Anurag Acharya and Mustafa Uysal and Robert Bennett and Assaf
  Mendelson and Michael Beynon and Jeffrey K. Hollingsworth and Joel Saltz and
  Alan Sussman},
  title = {Tuning the Performance of {I/O} Intensive Parallel Applications},
  booktitle = {Proceedings of the Fourth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1996},
  month = {May},
  pages = {15--27},
  publisher = {ACM Press},
  address = {Philadelphia},
  keywords = {parallel I/O, filesystem workload, parallel application,
  pario-bib},
  abstract = {Getting good I/O performance from parallel programs is a critical
  problem for many application domains. In this paper, we report our experience
  tuning the I/O performance of four application programs from the areas of
  satellite-data processing and linear algebra. After tuning, three of the four
  applications achieve application-level I/O rates of over 100 MB/s on 16
  processors. The total volume of I/O required by the programs ranged from
  about 75 MB to over 200 GB. We report the lessons learned in achieving high
  I/O performance from these applications, including the need for code
  restructuring, local disks on every node and knowledge of future I/O
  requests. We also report our experience on achieving high performance on
  peer-to-peer configurations. Finally, we comment on the necessity of complex
  I/O interfaces like collective I/O and strided requests to achieve high
  performance.}
}

@Article{aggarwal:sorting,
  author = {Alok Aggarwal and Jeffrey Scott Vitter},
  title = {The Input/Output Complexity of Sorting and Related Problems},
  journal = {Communications of the ACM},
  year = {1988},
  month = {September},
  volume = {31},
  number = {9},
  pages = {1116--1127},
  keywords = {parallel I/O, sorting, pario-bib},
  abstract = {We provide tight upper and lower bounds, up to a constant factor,
  for the number of inputs and outputs~(I/Os) between internal memory and
  secondary storage required for five sorting-related problems: sorting, the
  fast Fourier transform (FFT), permutation networks, permuting, and matrix
  transposition. The bounds hold both in the worst case and in the average
  case, and in several situations the constant factors match. Secondary storage
  is modeled as a magnetic disk capable of transfering $P$~blocks each
  containing $B$~records in a single time unit; the records in each block must
  be input from or output to $B$~contiguous locations on the disk. We give two
  optimal algorithms for the problems, which are variants of merge sorting and
  distribution sorting. In particular we show for $P=1$ that the standard merge
  sorting algorithm is an optimal external sorting method, up to a constant
  factor in the number of~I/Os. Our sorting algorithms use the same number
  of~I/Os as does the permutation phase of key sorting, except when the
  internal memory size is extremely small, thus affirming the popular adage
  that key sorting is not faster. We also give a simpler and more direct
  derivation of Hong and Kung's lower bound for the FFT for the special case $B
  = P = O(1)$.},
  comment = {Good comments on typical external sorts, and how big they are.
  Focuses on parallelism at the disk. They give tight theoretical bounds on the
  number of I/O's required to do external sorting and other problems (FFTs,
  matrix transpose, etc.). If $x$ is the number of blocks in the file and $y$
  is the number of blocks that fit in memory, then the number of I/Os needed
  grows as $\Theta (x \log x / \log y)$. If parallel transfers of $p$ blocks
  are allowed, speedup linear in $p$ is obtained.}
}

@InProceedings{agrawal:asynch,
  author = {Gagan Agrawal and Anurag Acharya and Joel Saltz},
  title = {An Interprocedural Framework for Placement of Asynchronous {I/O}
  Operations},
  booktitle = {Proceedings of the 10th ACM International Conference on
  Supercomputing},
  year = {1996},
  month = {May},
  pages = {358--365},
  publisher = {ACM Press},
  address = {Philadelphia, PA},
  keywords = {compiler, I/O, pario-bib},
  comment = {Not really about parallel applications or parallel I/O, but I
  think it may be of interest to that community. They propose a compiler
  framework for a compiler to insert asynchronous I/O operations (start I/O,
  finish I/O), to satisfy the dependency constraints of the program.}
}

@Article{aguilar:graph,
  author = {Jose Aguilar},
  title = {A Graph Theoretical Model for Scheduling Simultaneous {I/O}
  Operations on Parallel and Distributed Environments},
  journal = {Parallel Processing Letters},
  year = {2002},
  month = {March},
  volume = {12},
  number = {1},
  pages = {113--126},
  publisher = {World Scientific},
  URL = {http://www.worldscinet.com/ppl/12/1201/S0129626402000860.html},
  keywords = {parallel I/O, scheduling, pario-bib},
  abstract = {The motivation for the research presented here is to develop an
  approach for scheduling I/O operations in distributed/parallel computer
  systems. First, a general model for specifying the parallel I/O scheduling
  problem is developed. The model defines the I/O bandwidth for different
  parallel/distributed architectures. Then the model is used to establish an
  algorithm for scheduling I/O operations on these architectures.}
}

@InProceedings{ali:enhancing,
  author = {Zeyad Ali and Qutaibah Malluhi},
  title = {Enhancing data-intensive applications performance by tuning the
  distributed storage policies},
  booktitle = {Proceedings of the International Conference on Parallel and
  Distributed Processing Techniques and Applications, PDPTA'04},
  year = {2004},
  month = {June},
  volume = {3},
  pages = {1515--1522},
  copyright = {(c)2005 Elsevier Engineering Information, Inc.},
  address = {Las Vegas, NV},
  keywords = {application-specific storage policies, pario-app, DTViewer,
  access patterns, data layout, pario-bib},
  abstract = {This paper describes the performance improvements achieved by a
  data-intensive application by controlling the storage policies and algorithms
  of a distributed storage system. The Network Storage Manager (NSM) is a mass
  distributed storage framework with a unique architecture that provides
  applications with the high-performance features they need. It also provides
  the standard most commonly used implementation for storage policies.
  Distributed Terrain Viewer (DTViewer) is an application that utilizes NSM
  architecture and for efficient and reliable data delivery. Moreover, it
  exploits NSM controllable architecture by plugging-in its
  application-specific optimized implementations. DTViewer overrides the
  default NSM policies that do not understand its sophisticated access
  patterns, partitioning, and storage layout requirements. Experimental results
  have show significant improvement achieved when the application-tailored
  implementation are used. Such speedups are not achievable on storage systems
  with no application control such as the Parallel Virtual File System PVFS.
  (44 Refs.)}
}

@Article{allcock:grid,
  author = {Bill Allcock and Joe Bester and John Bresnahan and Ann L. Chervenak
  and Ian Foster and Carl Kesselman and Sam Meder and Veronika Nefedova and
  Darcy Quesnel and Steven Tuecke},
  title = {Data management and transfer in high-performance computational grid
  environments},
  journal = {Parallel Computing},
  year = {2002},
  month = {May},
  volume = {28},
  number = {5},
  pages = {749--771},
  publisher = {Elsevier Science},
  URL = {http://www.elsevier.com/gej-ng/10/35/21/60/57/31/abstract.html},
  keywords = {computational grid, data transfer, network, I/O, pario-bib},
  abstract = {An emerging class of data-intensive applications involve the
  geographically dispersed extraction of complex scientific information from
  very large collections of measured or computed data. Such applications arise,
  for example, in experimental physics, where the data in question is generated
  by accelerators, and in simulation science, where the data is generated by
  supercomputers. So-called Data Grids provide essential infrastructure for
  such applications, much as the Internet provides essential services for
  applications such as e-mail and the Web. We describe here two services that
  we believe are fundamental to any Data Grid: reliable, high-speed transport
  and replica management. Our high-speed transport service, GridFTP, extends
  the popular FTP protocol with new features required for Data Grid
  applications, such as striping and partial file access. Our replica
  management service integrates a replica catalog with GridFTP transfers to
  provide for the creation, registration, location, and management of dataset
  replicas. We present the design of both services and also preliminary
  performance results. Our implementations exploit security and other services
  provided by the Globus Toolkit.}
}

@InProceedings{allen:cactus,
  author = {Gabrielle Allen and Tom Goodale and Joan Mass\'o and Edward
  Seidel},
  title = {The Cactus Computational Toolkit and Using Distributed Computing to
  Collide Neutron Stars},
  booktitle = {Proceedings of the Eighth IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1999},
  month = {August},
  pages = {57--61},
  publisher = {IEEE Computer Society Press},
  address = {Redondo Beach, CA},
  URL = {http://computer.org/conferen/proceed/hpdc/0287/02870007abs.htm},
  keywords = {scientific application, grid, input/output, parallel-io,
  pario-bib},
  abstract = {We are developing a system for collaborative research and
  development for a distributed group of researchers at different institutions
  around the world. In a new paradigm for collaborative computational science,
  the computer code and supporting infrastructure itself becomes the
  collaborating instrument, just as an accelerator becomes the collaborating
  tool for large numbers of distributed researchers in particle physics. The
  design of this "Collaboratory" allows many users, with very different areas
  of expertise, to work coherently together, on distributed computers around
  the world. Different supercomputers may be used separately, or for problems
  exceeding the capacity of any single system, multiple supercomputers may be
  networked together through high speed gigabit networks. Central to this
  Collaboratory is a new type of community simulation code, called "Cactus".
  The scientific driving force behind this project is the simulation of
  Einstein's equations for studying black holes, gravitational waves, and
  neutron stars, which has brought together researchers in very different
  fields from many groups around the world to make advances in the study of
  relativity and astrophysics. But the system is also being developed to
  provide scientists and engineers, without expert knowledge of parallel or
  distributed computing, mesh refinement, and so on, with a simple framework
  for solving any system of partial differential equations on many parallel
  computer systems, from traditional supercomputers to networks of
  workstations.},
  comment = {invited talk. They describe a computational toolkit (CACTUS) that
  allows developers to construct code modules (thorns) to plug into the core
  system (cactus flesh). The toolkit includes thorns for solving partial
  differential equations using MPI, parallel elliptic solvers, thorns for I/O
  using FlexIO or HDF5, and thorns for checkpointing. The talk showed results
  from a cactus code demo that ran at SC'98. The demo combined two
  tightly-connected supercomputers (one in Europe and one in America) using
  Globus to simulate the collision of two neutron stars.}
}

@InProceedings{alvarez:failures,
  author = {Guillermo A. Alvarez and Walter A. Burkhard and Flaviu Cristian},
  title = {Tolerating Multiple Failures in {RAID} Architectures with Optimal
  Storage and Uniform Declustering},
  booktitle = {Proceedings of the 24th Annual International Symposium on
  Computer Architecture},
  year = {1997},
  month = {May},
  pages = {62--72},
  publisher = {IEEE Computer Society Press},
  later = {alvarez:bfailures},
  URL = {http://portal.acm.org/citation.cfm?id=264107.264132},
  keywords = {fault tolerance, RAID, disk array, parallel I/O, pario-bib}
}

@Article{alvarez:jminerva,
  author = {Guillermo A. Alvarez and Elizabeth Borowsky and Susie Go and
  Theodore H. Romer and Ralph Becker-Szendy and Richard Golding and Arif
  Merchant and Mirjana Spasojevic and Alistair Veitch and John Wilkes},
  title = {Minerva: An automated resource provisioning tool for large-scale
  storage systems},
  journal = {ACM Transactions on Computer Systems},
  year = {2001},
  month = {November},
  volume = {19},
  number = {4},
  pages = {483--518},
  URL = {http://doi.acm.org/10.1145/502912.502915},
  keywords = {disk array, storage system, RAID, automatic design, parallel I/O,
  pario-bib},
  abstract = {Enterprise-scale storage systems, which can contain hundreds of
  host computers and storage devices and up to tens of thousands of disks and
  logical volumes, are difficult to design. The volume of choices that need to
  be made is massive, and many choices have unforeseen interactions. Storage
  system design is tedious and complicated to do by hand, usually leading to
  solutions that are grossly over-provisioned, substantially under-performing
  or, in the worst case, both.To solve the configuration nightmare, we present
  minerva: a suite of tools for designing storage systems automatically.
  Minerva uses declarative specifications of application requirements and
  device capabilities; constraint-based formulations of the various
  sub-problems; and optimization techniques to explore the search space of
  possible solutions.This paper also explores and evaluates the design
  decisions that went into Minerva, using specialized micro- and
  macro-benchmarks. We show that Minerva can successfully handle a workload
  with substantial complexity (a decision-support database benchmark). Minerva
  created a 16-disk design in only a few minutes that achieved the same
  performance as a 30-disk system manually designed by human experts. Of equal
  importance, Minerva was able to predict the resulting system's performance
  before it was built.}
}

@InProceedings{alverson:tera,
  author = {Robert Alverson and David Callahan and Daniel Cummings and Brian
  Koblenz and Allan Porterfield and Burton Smith},
  title = {The {Tera} Computer System},
  booktitle = {Proceedings of the 1990 ACM International Conference on
  Supercomputing},
  year = {1990},
  pages = {1--6},
  keywords = {parallel architecture, MIMD, NUMA, pario-bib},
  comment = {Interesting architecture. 3-d mesh of pipelined packet-switch
  nodes, e.g., 16x16x16 is 4096 nodes, with 256 procs, 512 memory units, 256 I/O
  cache units, and 256 I/O processors attached. 2816 remaining nodes are just
  switching nodes. Each processor is 64-bit custom chip with up to 128
  simultaneous threads in execution. It alternates between ready threads, with
  a deep pipeline. Inter-instruction dependencies explicitly encoded by the
  compiler, stalling those threads until the appropriate time. Each thread has
  a complete set of registers! Memory units have 4-bit tags on each word, for
  full/empty and trap bits. Shared memory across the network: ``The Tera
  ISP-level architecture is UMA, even though the PMS-level architecture is
  NUMA. Put another way, the memory looks a single cycle away to the compiler
  writer.'' -- Burton Smith. See also tera:brochure.}
}

@InCollection{anderson:bserverless,
  author = {Thomas E. Anderson and Michael D. Dahlin and Jeanna M. Neefe
  Matthews and David A. Patterson and Drew S. Roselli and Randolph Y. Wang},
  title = {Serverless Network File Systems},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {24},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {364--385},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {anderson:serverless},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {file caching, distributed file system, pario-bib},
  comment = {Part of jin:io-book; reformatted version of anderson:serverless.}
}

@InProceedings{anderson:buttress,
  author = {Eric Anderson and Mahesh Kallahalla and Mustafa Uysal and Ram
  Swaminathan},
  title = {Buttress: A toolkit for flexible and high fidelity {I/O}
  benchmarking},
  booktitle = {Proceedings of the USENIX FAST '04 Conference on File and
  Storage Technologies},
  year = {2004},
  month = {March},
  pages = {45--58},
  organization = {Hewlett-Packard Laboratories},
  publisher = {USENIX Association},
  address = {San Francisco, CA},
  URL = {http://www.usenix.org/events/fast04/tech/anderson.html},
  keywords = {benchmarking software, performance analysis, I/O access patterns,
  I/O workloads, pario-bib},
  abstract = {In benchmarking I/O systems, it is important to generate,
  accurately, the I/O access pattern that one is intending to generate.
  However, timing accuracy ( issuing I/Os at the desired time) at high I/O
  rates is difficult to achieve on stock operating systems. We currently lack
  tools to easily and accurately generate complex I/O workloads on modern
  storage systems. As a result, we may be introducing substantial errors in
  observed system metrics when we benchmark I/O systems using inaccurate tools
  for replaying traces or for producing synthetic workloads with known
  inter-arrival times. \par In this paper, we demonstrate the need for timing
  accuracy for I/O benchmarking in the context of replaying I/O traces. We also
  quantitatively characterize the impact of error in issuing I/Os on measured
  system parameters. For instance, we show that the error in perceived I/O
  response times can be as much as +350% or -15% by using naive benchmarking
  tools that have timing inaccuracies. To address this problem, we present
  Buttress, a portable and flexible toolkit that can generate I/O workloads
  with microsecond accuracy at the I/O throughputs of high-end enterprise
  storage arrays. In particular, Buttress can issue I/O requests within 100µs
  of the desired issue time even at rates of 10000 I/Os per second (IOPS).},
  comment = {Looks like a really cool piece of software. Generates I/O
  workloads by replaying I/O traces.}
}

@InProceedings{anderson:raid,
  author = {Eric Anderson and Ram Swaminathan and Alistair Veitch and Guillermo
  {A. Alvarez} and John Wilkes},
  title = {Selecting {RAID} Levels for Disk Arrays},
  booktitle = {Proceedings of the USENIX FAST '02 Conference on File and
  Storage Technologies},
  year = {2002},
  month = {January},
  pages = {189--202},
  publisher = {USENIX Association},
  address = {Monterey, CA},
  URL =
  {http://www.usenix.org/publications/library/proceedings/fast02/andersonRAID.html},
  keywords = {file systems, pario-bib},
  abstract = {Disk arrays have a myriad of configuration parameters that
  interact in counter-intuitive ways, and those interactions can have
  significant impacts on cost, performance, and reliability. Even after values
  for these parameters have been chosen, there are exponentially-many ways to
  map data onto the disk arrays' logical units. Meanwhile, the importance of
  correct choices is increasing: storage systems represent an growing fraction
  of total system cost, they need to respond more rapidly to changing needs,
  and there is less and less tolerance for mistakes. We believe that automatic
  design and configuration of storage systems is the only viable solution to
  these issues. To that end, we present a comparative study of a range of
  techniques for programmatically choosing the RAID levels to use in a disk
  array. Our simplest approaches are modeled on existing, manual rules of
  thumb: they "tag" data with a RAID level before determining the configuration
  of the array to which it is assigned. Our best approach simultaneously
  determines the RAID levels for the data, the array configuration, and the
  layout of data on that array. It operates as an optimization process with the
  twin goals of minimizing array cost while ensuring that storage workload
  performance requirements will be met. This approach produces robust solutions
  with an average cost/performance 14-17{PCT} better than the best results for
  the tagging schemes, and up to 150-200{PCT} better than their worst
  solutions. We believe that this is the first presentation and systematic
  analysis of a variety of novel, fully-automatic RAID-level selection
  techniques.}
}

@Article{anderson:serverless,
  author = {Thomas E. Anderson and Michael D. Dahlin and Jeanna M. Neefe and
  David A. Patterson and Drew S. Roselli and Randolph Y. Wang},
  title = {Serverless Network File Systems},
  journal = {ACM Transactions on Computer Systems},
  year = {1996},
  month = {February},
  volume = {14},
  number = {1},
  pages = {41--79},
  publisher = {ACM Press},
  later = {anderson:bserverless},
  URL =
  {http://www.acm.org/pubs/citations/journals/tocs/1996-14-1/p41-anderson/},
  keywords = {file caching, distributed file system, pario-bib},
  comment = {See anderson:serverless-sosp.}
}

@InProceedings{ap:enwrich,
  author = {Apratim Purakayastha and Carla Schlatter Ellis and David Kotz},
  title = {{ENWRICH:} A Compute-Processor Write Caching Scheme for Parallel
  File Systems},
  booktitle = {Proceedings of the Fourth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1996},
  month = {May},
  pages = {55--68},
  publisher = {ACM Press},
  copyright = {ACM},
  address = {Philadelphia},
  earlier = {ap:enwrich-tr},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/ap:enwrich.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/ap:enwrich.pdf},
  keywords = {parallel file system, parallel I/O, caching, pario-bib, dfk},
  abstract = {Many parallel scientific applications need high-performance I/O.
  Unfortunately, end-to-end parallel-I/O performance has not been able to keep
  up with substantial improvements in parallel-I/O hardware because of poor
  parallel file-system software. Many radical changes, both at the interface
  level and the implementation level, have recently been proposed. One such
  proposed interface is {\em collective I/O}, which allows parallel jobs to
  request transfer of large contiguous objects in a single request, thereby
  preserving useful semantic information that would otherwise be lost if the
  transfer were expressed as per-processor non-contiguous requests. Kotz has
  proposed {\em disk-directed I/O} as an efficient implementation technique for
  collective-I/O operations, where the compute processors make a single
  collective data-transfer request, and the I/O processors thereafter take full
  control of the actual data transfer, exploiting their detailed knowledge of
  the disk-layout to attain substantially improved performance. \par Recent
  parallel file-system usage studies show that writes to write-only files are a
  dominant part of the workload. Therefore, optimizing writes could have a
  significant impact on overall performance. In this paper, we propose ENWRICH,
  a compute-processor write-caching scheme for write-only files in parallel
  file systems. ENWRICH combines low-overhead write caching at the compute
  processors with high performance disk-directed I/O at the I/O processors to
  achieve both low latency and high bandwidth. This combination facilitates the
  use of the powerful disk-directed I/O technique independent of any particular
  choice of interface. By collecting writes over many files and applications,
  ENWRICH lets the I/O processors optimize disk I/O over a large pool of
  requests. We evaluate our design via simulated implementation and show that
  ENWRICH achieves high performance for various configurations and workloads.}
}

@TechReport{ap:enwrich-tr,
  author = {Apratim Purakayastha and Carla Schlatter Ellis and David Kotz},
  title = {{ENWRICH:} A Compute-Processor Write Caching Scheme for Parallel
  File Systems},
  year = {1995},
  month = {October},
  number = {CS-1995-22},
  institution = {Dept. of Computer Science, Duke University},
  copyright = {the authors},
  later = {ap:enwrich},
  URL = {ftp://ftp.cs.duke.edu/dist/techreport/1995/1995-22.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/ap:enwrich-tr.pdf},
  keywords = {parallel file system, parallel I/O, caching, pario-bib, dfk},
  abstract = {Many parallel scientific applications need high-performance I/O.
  Unfortunately, end-to-end parallel-I/O performance has not been able to keep
  up with substantial improvements in parallel-I/O hardware because of poor
  parallel file-system software. Many radical changes, both at the interface
  level and the implementation level, have recently been proposed. One such
  proposed interface is {\em collective I/O}, which allows parallel jobs to
  request transfer of large contiguous objects in a single request, thereby
  preserving useful semantic information that would otherwise be lost if the
  transfer were expressed as per-processor non-contiguous requests. Kotz has
  proposed {\em disk-directed I/O} as an efficient implementation technique for
  collective-I/O operations, where the compute processors make a single
  collective data-transfer request, and the I/O processors thereafter take full
  control of the actual data transfer, exploiting their detailed knowledge of
  the disk-layout to attain substantially improved performance. \par Recent
  parallel file-system usage studies show that writes to write-only files are a
  dominant part of the workload. Therefore, optimizing writes could have a
  significant impact on overall performance. In this paper, we propose ENWRICH,
  a compute-processor write-caching scheme for write-only files in parallel
  file systems. ENWRICH combines low-overhead write caching at the compute
  processors with high performance disk-directed I/O at the I/O processors to
  achieve both low latency and high bandwidth. This combination facilitates the
  use of the powerful disk-directed I/O technique independent of any particular
  choice of interface. By collecting writes over many files and applications,
  ENWRICH lets the I/O processors optimize disk I/O over a large pool of
  requests. We evaluate our design via simulated implementation and show that
  ENWRICH achieves high performance for various configurations and workloads.}
}

@PhdThesis{ap:thesis,
  author = {Apratim Purakayastha},
  title = {Characterizing and Optimizing Parallel File Systems},
  year = {1996},
  month = {June},
  school = {Dept. of Computer Science, Duke University},
  address = {Durham, NC},
  note = {Also available as technical report CS-1996-10},
  URL = {ftp://ftp.cs.duke.edu/dist/techreport/1996/1996-10.ps.gz},
  keywords = {parallel I/O, multiprocessor file system, file access patterns,
  workload characterization, file caching, disk-directed I/O, pario-bib},
  abstract = {High-performance parallel file systems are needed to satisfy
  tremendous I/O requirements of parallel scientific applications. The design
  of such parallel file systems depends on a comprehensive understanding of the
  expected workload, but so far there have been very few usage studies of
  multiprocessor file systems. In the first part of this dissertation, we
  attempt to fill this void by measuring a real file-system workload on a
  production parallel machine, namely the CM-5 at the National Center for
  Supercomputing Applications. We collect information about nearly every
  individual I/O request from the mix of jobs running on the machine. Analysis
  of the traces leads to various recommendations for design of future parallel
  file systems. Our usage study showed that writes to write-only files are a
  dominant part of the workload. Therefore, optimizing writes could have a
  significant impact on overall performance. In the second part of this
  dissertation, we propose ENWRICH, a compute-processor write-caching scheme
  for write-only files in parallel file systems. Within its framework, ENWRICH
  uses a recently proposed high performance implementation of collective I/O
  operations called disk-directed I/O, but it eliminates a number of
  limitations of disk-directed I/O. ENWRICH combines low-overhead write caching
  at the compute processors with high performance disk-directed I/O at the I/O
  processors to achieve both low latency and high bandwidth. This combination
  facilitates the use of the powerful disk-directed I/O technique independent
  of any particular choice of interface, and without the requirement for
  mapping libraries at the I/O processors. By collecting writes over many files
  and applications, ENWRICH lets the I/O processors optimize disk I/O over a
  large pool of requests. We evaluate our design of ENWRICH using simulated
  implementation and extensive experimentation. We show that ENWRICH achieves
  high performance for various configurations and workloads. We pinpoint the
  reasons for ENWRICH`s failure to perform well for certain workloads, and
  suggest possible enhancements. Finally, we discuss the nuances of
  implementing ENWRICH on a real platform and speculate about possible
  adaptations of ENWRICH for emerging multiprocessing platforms.},
  comment = {See also ap:enwrich, ap:workload, and nieuwejaar:workload}
}

@InProceedings{ap:workload,
  author = {Apratim Purakayastha and Carla Schlatter Ellis and David Kotz and
  Nils Nieuwejaar and Michael Best},
  title = {Characterizing Parallel File-Access Patterns on a Large-Scale
  Multiprocessor},
  booktitle = {Proceedings of the Ninth International Parallel Processing
  Symposium},
  year = {1995},
  month = {April},
  pages = {165--172},
  publisher = {IEEE Computer Society Press},
  copyright = {IEEE},
  earlier = {ap:workload-tr},
  later = {nieuwejaar:workload-tr},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/ap:workload.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/ap:workload.pdf},
  keywords = {parallel I/O, file access pattern, multiprocessor file system,
  file system workload, dfk, pario-bib},
  abstract = {High-performance parallel file systems are needed to satisfy
  tremendous I/O requirements of parallel scientific applications. The design
  of such high-performance parallel file systems depends on a comprehensive
  understanding of the expected workload, but so far there have been very few
  usage studies of multiprocessor file systems. This paper is part of the
  CHARISMA project, which intends to fill this void by measuring real
  file-system workloads on various production parallel machines. In particular,
  here we present results from the CM-5 at the National Center for
  Supercomputing Applications. Our results are unique because we collect
  information about nearly every individual I/O request from the mix of jobs
  running on the machine. Analysis of the traces leads to various
  recommendations for parallel file-system design.},
  comment = {See also kotz:workload, nieuwejaar:strided.}
}

@TechReport{ap:workload-tr,
  author = {Apratim Purakayastha and Carla Schlatter Ellis and David Kotz and
  Nils Nieuwejaar and Michael Best},
  title = {Characterizing Parallel File-Access Patterns on a Large-Scale
  Multiprocessor},
  year = {1994},
  month = {October},
  number = {CS-1994-33},
  institution = {Dept. of Computer Science, Duke University},
  copyright = {the authors},
  later = {ap:workload},
  URL = {ftp://ftp.cs.duke.edu/pub/dist/techreport/1994/1994-33.ps.Z},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/ap:workload-tr.pdf},
  keywords = {parallel I/O, file access pattern, multiprocessor file system,
  file system workload, dfk, pario-bib},
  abstract = {Rapid increases in the computational speeds of multiprocessors
  have not been matched by corresponding performance enhancements in the I/O
  subsystem. To satisfy the large and growing I/O requirements of some parallel
  scientific applications, we need parallel file systems that can provide
  high-bandwidth and high-volume data transfer between the I/O subsystem and
  thousands of processors. \par Design of such high-performance parallel file
  systems depends on a thorough grasp of the expected workload. So far there
  have been no comprehensive usage studies of multiprocessor file systems. Our
  CHARISMA project intends to fill this void. The first results from our study
  involve an iPSC/860 at NASA Ames. This paper presents results from a
  different platform, the CM-5 at the National Center for Supercomputing
  Applications. The CHARISMA studies are unique because we collect information
  about every individual read and write request and about the entire mix of
  applications running on the machines. \par The results of our trace analysis
  lead to recommendations for parallel file system design. First, the file
  system should support efficient concurrent access to many files, and I/O
  requests from many jobs under varying load condit ions. Second, it must
  efficiently manage large files kept open for long periods. Third, it should
  expect to see small requests, predominantly sequential access patterns,
  application-wide synchronous access, no concurrent file-sharing between jobs,
  appreciable byte and block sharing between processes within jobs, and strong
  interprocess locality. Finally, the trace data suggest that node-level write
  caches and collective I/O request interfaces may be useful in certain
  environments.},
  comment = {See also kotz:workload, nieuwejaar:strided.}
}

@TechReport{arendt:genome,
  author = {James W. Arendt},
  title = {Parallel Genome Sequence Comparison Using a Concurrent File System},
  year = {1991},
  number = {UIUCDCS-R-91-1674},
  institution = {University of Illinois at Urbana-Champaign},
  keywords = {parallel file system, parallel I/O, Intel iPSC/2, pario-bib},
  comment = {Studies the performance of Intel CFS. Uses an application that
  reads in a huge file of records, each a genome sequence, and compares each
  sequence against a given sequence. Looks at cache performance, message
  latency, cost of prefetches and directory reads, and throughput. He tries
  one-disk, one-proc transfer rates. Because of contention with the directory
  server on one of the two I/O nodes, it was faster to put all of the file on
  the other I/O node. Striping is good for multiple readers. Best access
  pattern was interleaved, not segmented or separate files, because it avoided
  disk seeks. Perhaps the files are stored contiguously? Can get good speedup
  by reading the sequences in big integral record sizes, from CFS, using a
  load-balancing scheduled by the host. Contention for directory blocks --
  through single-node directory server.}
}

@InCollection{arge:GIS,
  author = {Lars Arge},
  title = {External-memory algorithms with applications in {GIS}},
  booktitle = {Algorithmic foundations of geographic information systems},
  editor = {Marc van Kreveld and Jurg Nievergelt and Thomas Roos and Peter
  Widmayer},
  year = {1997},
  series = {Lecture Notes in Computer Science},
  volume = {1340},
  pages = {213--254},
  publisher = {Springer-Verlag},
  URL = {http://www.cs.duke.edu/~large/Papers/gisnotes.ps},
  keywords = {out-of-core algorithm, geographic information system, GIS,
  pario-bib},
  abstract = {The paper presents a survey of the basic paradigms for designing
  efficient external-memory algorithms and especially for designing
  external-memory algorithms for computational geometry problems with
  applications in GIS. As the area of external-memory algorithms is relatively
  young the paper focuses on fundamental external-memory design techniques more
  than on algorithms for specific GIS problems. The presentation is survey-like
  with a more detailed discussion of the most important techniques and
  algorithms.},
  comment = {not parallel? but mentions some parallel disk stuff.}
}

@Article{arge:jsegments,
  author = {Lars Arge and Darren Erik Vengroff and Jeffrey Scott Vitter},
  title = {External-Memory Algorithms for Processing Line Segments in
  Geographic Information Systems},
  journal = {Algorithmica},
  year = {1998},
  note = {To appear},
  earlier = {arge:segments},
  URL = {ftp://cs.duke.edu/pub/jsv/Papers/AVV97.SegmentGIS.ps.gz},
  keywords = {verify, out-of-core algorithm, computational geometry,
  pario-bib},
  abstract = {We present a set of algorithms designed to solve large-scale
  geometric problems involving collections of line segments in the plane.
  Geographical information systems (GIS) handle large amounts of spatial data,
  and at some level the data is often manipulated as collections of line
  segments. NASA's EOS project is an example of a GIS that is expected to store
  and manipulate petabytes (thousands of terabytes, or millions of gigabytes)
  of data! In the design of algorithms for this type of large-scale
  application, it is essential to consider the problem of minimizing I/O
  communication, which is the bottleneck. \par In this paper we develop
  efficient new external-memory algorithms for a number of important problems
  involving line segments in the plane, including trapezoid decomposition,
  batched planar point location, triangulation, red-blue line segment
  intersection reporting, and general line segment intersection reporting. In
  GIS systems, the first three problems are useful for rendering and modeling,
  and the latter two are frequently used for overlaying maps and extracting
  information from them. To solve these problems, we combine and modify in
  novel ways several of the previously known techniques for designing efficient
  algorithms for external memory. We also develop a powerful new technique that
  can be regarded as a practical external memory version of fractional
  cascading. Except for the batched planar point location problem, no
  algorithms specifically designed for external memory were previously known
  for these problems. Our algorithms for triangulation and line segment
  intersection partially answer previously posed open problems, while the
  batched planar point location algorithm improves on the previously known
  solution, which applied only to monotone decompositions. Our algorithm for
  the red-blue line segment intersection problem is provably optimal.},
  comment = {Special issue on cartography and geographic information systems.}
}

@InCollection{arge:lower,
  author = {Lars Arge and Peter Bro Miltersen},
  title = {On showing lower bounds for external-memory computational geometry
  problems},
  booktitle = {External Memory Algorithms and Visualization},
  editor = {James Abello and Jeffrey Scott Vitter},
  crossref = {abello:dimacs},
  year = {1999},
  series = {DIMACS Series in Discrete Mathematics and Theoretical Computer
  Science},
  pages = {139--160},
  publisher = {American Mathematical Society Press},
  address = {Providence, RI},
  keywords = {out-of-core algorithm, computational geometry, pario-bib},
  comment = {See also the component papers vitter:survey, arge:lower,
  crauser:segment, grossi:crosstrees, toledo:survey. Not clear to what extent
  these papers are about *parallel* I/O.}
}

@InProceedings{arge:segments,
  author = {Lars Arge and Darren Erik Vengroff and Jeffrey Scott Vitter},
  title = {External-Memory Algorithms for Processing Line Segments in
  Geographic Information Systems},
  booktitle = {Proceedings of the Third European Symposium on Algorithms},
  year = {1995},
  month = {September},
  series = {Lecture Notes in Computer Science},
  volume = {979},
  pages = {295--310},
  publisher = {Springer-Verlag},
  address = {Corfu, Greece},
  later = {arge:jsegments},
  URL = {ftp://cs.duke.edu/pub/jsv/Papers/AVV95.SegmentGIS.ps.Z},
  keywords = {out-of-core algorithm, computational geometry, pario-bib},
  abstract = {In the design of algorithms for large-scale applications it is
  essential to consider the problem of minimizing I/O communication.
  Geographical information systems (GIS) are good examples of such large-scale
  applications as they frequently handle huge amounts of spatial data. In this
  paper we develop efficient new external-memory algorithms for a number of
  important problems involving line segments in the plane, including trapezoid
  decomposition, batched planar point location, triangulation, red-blue line
  segment intersection reporting, and general line segment intersection
  reporting. In GIS systems, the first three problems are useful for rendering
  and modeling, and the latter two are frequently used for overlaying maps and
  extracting information from them.},
  comment = {Does deal with parallel disks, though not in great detail.}
}

@InProceedings{arge:sorting,
  author = {Lars Arge and Paolo Ferragina and Roberto Grossi and Jeffrey Scott
  Vitter},
  title = {Sequence sorting in secondary storage},
  booktitle = {Proceedings of Compression and Complexity of Sequences},
  year = {1998},
  month = {June},
  pages = {329--346},
  publisher = {IEEE Computer Society Press},
  address = {Salerno, Italy},
  keywords = {out-of-core algorithm, sorting algorithm, pario-bib},
  abstract = {We investigate the I/O complexity of the problem of sorting
  sequences (or strings of characters) in external memory, which is a
  fundamental component of many large-scale text applications. In the standard
  unit-cost RAM comparison model, the complexity of sorting K strings of total
  length N is Theta (K log/sub 2/ K+N). By analogy, in the external memory (or
  I/O) model, where the internal memory has size M and the block transfer size
  is B, it would be natural to guess that the I/O complexity of sorting
  sequences is Theta ((K/B)log/sub M/B/(K/B)+(N/B)), but the known algorithms
  do not come even close to achieving this bound. Our results show, somewhat
  counterintuitively, that the I/O complexity of string sorting depends upon
  the length of the strings relative to the block size. We first consider a
  simple comparison I/O model, where the strings are not allowed to be broken
  into their individual characters, and we show that the I/O complexity of
  string sorting in this model is Theta ((N/sub 1//B)log/sub M/B/(N/sub
  1//B)+K/sub 2/+(N/B)), where N/sub 1/ is the total length of all strings
  shorter than B and K/sub 2/ is the number of strings longer than B. We then
  consider two more general I/O comparison models in which string breaking is
  allowed. We obtain improved algorithms and in several cases lower bounds that
  match their I/O bounds. Finally, we develop more practical algorithms outside
  the comparison model.},
  comment = {This paper is really the same paper as arge:sorting-strings.}
}

@InProceedings{arge:sorting-strings,
  author = {Lars Arge and Paolo Ferragina and Roberto Grossi and Jeffrey Scott
  Vitter},
  title = {On sorting strings in external memory},
  booktitle = {Proceedings of the 29th Annual ACM Symposium on Theory of
  Computing},
  year = {1997},
  month = {May},
  pages = {540--548},
  publisher = {ACM Press},
  address = {El Paso},
  URL = {file://ftp.cs.duke.edu/pub/jsv/Papers/AFG97.stringsorting.ps.gz},
  keywords = {out-of-core algorithm, sorting, parallel I/O, pario-bib},
  abstract = {In this paper we address for the first time the I/O complexity of
  the problem of sorting strings in external memory, which is a fundamental
  component of many large-scale text applications. In the standard unit-cost
  RAM comparison model, the complexity of sorting K strings of total length N
  is theta(K log K + N). By analogy, in the external memory (or I/O) model,
  where the internal memory has size M and the block transfer size is B, it
  would be natural to guess that the I/O complexity of sorting strings is
  $\theta(K/B log_(M/B) (K/B) + N/B)$, but the known algorithms do not come
  even close to achieving this bound. Our results show, somewhat
  counterintuitively, that the I/O complexity of string sorting depends upon
  the length of the strings relative to the block size. We first consider a
  simple comparison I/O model, where one is not allowed to break the strings
  into their characters, and we show that the I/O complexity of string sorting
  in this model is $\theta(N_1/B log_(M/B) (N_1/B) + K_2 log_(M/B) K_2 + N/B)$,
  where $N_1$ is the total length of all strings shorter than B and $K_2$ is
  the number of strings longer than B. We then consider two more general I/O
  comparison models in which string breaking is allowed. We obtain improved
  algorithms and in several cases lower bounds that match their I/O bounds.
  Finally, we develop more practical algorithms without assuming the comparison
  model.},
  comment = {Not parallel? But mentions some parallel disk stuff.}
}

@InProceedings{armen:disk-model,
  author = {Chris Armen},
  title = {Bounds on the Separation of Two Parallel Disk Models},
  booktitle = {Proceedings of the Fourth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1996},
  month = {May},
  pages = {122--127},
  publisher = {ACM Press},
  address = {Philadelphia},
  keywords = {parallel I/O, theory, parallel I/O algorithm, pario-bib},
  abstract = {The single-disk, D-head model of parallel I/O was introduced by
  Agarwal and Vitter to analyze algorithms for problem instances that are too
  large to fit in primary memory. Subsequently Vitter and Shriver proposed a
  more realistic model in which the disk space is partitioned into D disks,
  with a single head per disk. To date, each problem for which there is a known
  optimal algorithm for both models has the same asymptotic bounds on both
  models. Therefore, it has been unknown whether the models are equivalent or
  whether the single-disk model is strictly more powerful. \par In this paper
  we provide evidence that the single-disk model is strictly more powerful. We
  prove a lower bound on any general simulation of the single-disk model on the
  multi-disk model and establish randomized and deterministic upper bounds. Let
  $N$ be the problem size and let $T$ be the number of parallel I/Os required
  by a program on the single-disk model. Then any simulation of this program on
  the multi-disk model will require $\Omega\left(T \frac{\log(N/D)}{\log
  \log(N/D)}\right)$ parallel I/Os. This lower bound holds even if replication
  is allowed in the multi-disk model. We also show an $O\left(\frac{\log
  D}{\log \log D}\right)$ randomized upper bound and an $O\left(\log D (\log
  \log D)^2\right)$ deterministic upper bound. These results exploit an
  interesting analogy between the disk models and the PRAM and DCM models of
  parallel computation.}
}

@Article{arpaci-dusseau:jriver,
  author = {Arpaci-Dusseau, Remzi, H.},
  title = {Run-Time Adaptation in {R}iver},
  journal = {ACM Transactions on Computer Systems},
  year = {2003},
  month = {February},
  volume = {21},
  number = {1},
  pages = {36--86},
  publisher = {ACM Press},
  keywords = {distributed query processing, dataflow, pario-bib},
  comment = {River is a dataflow programming environment for database query
  processing applications. River is specifically designed for clusters of
  computers with heterogeneous performance characteristics. The goal of the
  River runtime system is to adapt to "performance faults"--portions of the
  system that perform poorly by dynamically adjusting the transfer of data
  through the dataflow graph. River uses two constructs to build applications:
  a distributed queue that deals with performance faults by consumers, and
  graduated declustering that deals with performance faults of producers. A
  distributed queue pushes data through the dataflow graph at a rate
  proportional to the rate of consumption and adapts to changes in consumption
  rates. Graduated declustering deals with producer performance faults by
  reading from replicated producers. Although River is designed specifically
  for query processing, they briefly discuss how one might adapt scientific
  applications to work in their framework.}
}

@InProceedings{arpaci-dusseau:river,
  author = {Remzi H. Arpaci-Dusseau and Eric Anderson and Noah Treuhaft and
  David E. Culler and Joseph M. Hellerstein and David Patterson and Kathy
  Yelick},
  title = {Cluster {I/O} with {River}: Making the Fast Case Common},
  booktitle = {Proceedings of the Sixth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1999},
  month = {May},
  pages = {10--22},
  publisher = {ACM Press},
  address = {Atlanta, GA},
  URL = {http://vibes.cs.uiuc.edu/IOPADS/Accepted/Remzi.ps},
  keywords = {cluster computing, parallel I/O, pario-bib},
  abstract = {We introduce River, a data-flow programming environment and I/O
  substrate for clusters of computers. River is designed to provide maximum
  performance in the common case--- even in the face of non-uniformities in
  hardware, software, and workload. River is based on two simple design
  features: a high-performance distributed queue,and a storage redundancy
  mechanism called graduated declustering.We have implemented a number of
  data-intensive applications on River, which validate our design with
  near-ideal performance in a variety of non-uniform performance scenarios.}
}

@InProceedings{arunachalam:prefetch,
  author = {Meenakshi Arunachalam and Alok Choudhary and Brad Rullman},
  title = {A Prefetching Prototype for the Parallel File System on the
  {Paragon}},
  booktitle = {Proceedings of the 1995 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1995},
  month = {May},
  pages = {321--323},
  note = {Extended Abstract},
  keywords = {parallel I/O, prefetching, parallel file system, pario-bib},
  comment = {A related paper is arunachalam:prefetch2.}
}

@InProceedings{arunachalam:prefetch2,
  author = {Meenkashi Arunachalam and Alok Choudhary and Brad Rullman},
  title = {Implementation and evaluation of prefetching in the {Intel Paragon
  Parallel File System}},
  booktitle = {Proceedings of the Tenth International Parallel Processing
  Symposium},
  year = {1996},
  month = {April},
  pages = {554--559},
  URL = {http://www.ece.nwu.edu/~meena/papers/ipps.ps},
  keywords = {parallel I/O, prefetching, multiprocessor file system,
  pario-bib},
  abstract = {The significant difference between the speeds of the I/O system
  (e.g., disks) and compute processors in parallel systems creates a bottleneck
  that lowers the performance of an application that does a considerable amount
  of disk accesses. A major portion of the compute processors' time is wasted
  on waiting for I/O to complete. This problem can be addressed to a certain
  extent, if the necessary data can be fetched from the disk before the I/O
  call to the disk is issued. Fetching data ahead of time, known as prefetching
  in a multiprocessor environment depends a great deal on the application's
  access pattern. The subject of this paper is implementation and performance
  evaluation of a prefetching prototype in a production parallel file system on
  the Intel Paragon. Specifically, this paper presents a) design and
  implementation of a prefetching strategy in the parallel file system and b)
  performance measurements and evaluation of the file system with and without
  prefetching. The prototype is designed at the operating system level for the
  PFS. It is implemented in the PFS subsystem of the Intel Paragon Operating
  System. It is observed that in many cases prefetching provides considerable
  performance improvements. In some other cases no improvements or some
  performance degradation is observed due to the overheads incurred in
  prefetching.},
  comment = {See arunachalam:prefetch.}
}

@InCollection{asami:bself,
  author = {Satoshi Asami and Nisha Talagala and David A. Patterson},
  title = {Designing a Self-Maintaining Storage System},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {30},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {453--463},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {asami:self},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, disk array, RAID, pario-bib},
  comment = {Part of jin:io-book; reformatted version of asami:self.}
}

@InProceedings{asami:self,
  author = {Satoshi Asami and Nisha Talagala and David A. Patterson},
  title = {Designing a Self-Maintaining Storage System},
  booktitle = {Proceedings of the Sixteenth IEEE Symposium on Mass Storage
  Systems},
  year = {1999},
  month = {March},
  pages = {222-233},
  publisher = {IEEE Computer Society Press},
  later = {asami:bself},
  URL = {http://storageconference.org/1999/1999/posters/22asami.pdf},
  keywords = {parallel I/O, disk array, RAID, pario-bib},
  abstract = {This paper shows the suitability of a self-maintaining approach
  to Tertiary Disk, a large-scale disk array system built from commodity
  components. Instead of incurring the cost of custom hardware, we attempt to
  solve various problems by design and software. We have built a cluster of
  storage nodes connected by switched Ethernet. Each storage node is a PC
  hosting a few dozen SCSI disks, running the FreeBSD operating system. The
  system is used as a web-based image server for the Zoom Project in
  cooperation with the Fine Arts Museums of San Francisco
  (http://www.thinker.org/). We are designing self-maintenance extension to the
  OS to run on this cluster to mitigate the system administrator's burden.
  There are several components required for building self-maintaining system.
  One is decoupling the time failure from the time of hardware replacement.
  This implies the system must have some amount of redundancy, and has no
  single point of failure. Our system is fully redundant, and everything is
  constructed to avoid a single point of failure. Another is correctly
  identifying failures and their dependencies. The paper also outlines several
  approaches to lower the human cost of system administration of such a system
  and making the system as autonomous as possible.}
}

@InProceedings{asbury:fortranio,
  author = {Raymond K. Asbury and David S. Scott},
  title = {{FORTRAN} {I/O} on the {iPSC/2}: Is there read after write?},
  booktitle = {Proceedings of the Fourth Conference on Hypercube Concurrent
  Computers and Applications},
  year = {1989},
  pages = {129--132},
  publisher = {Golden Gate Enterprises, Los Altos, CA},
  address = {Monterey, CA},
  keywords = {parallel I/O, hypercube, Intel iPSC/2, file access pattern,
  pario-bib}
}

@InProceedings{asthana:active,
  author = {Abhaya Asthana and Mark Cravatts and Paul Krzyzanowski},
  title = {An Experimental Active Memory Based {I/O} Subsystem},
  booktitle = {Proceedings of the IPPS~'94 Workshop on Input/Output in Parallel
  Computer Systems},
  year = {1994},
  month = {April},
  pages = {73--84},
  organization = {AT\&T Bell Labs},
  note = {Also appeared in Computer Architecture News 22(4)},
  later = {asthana:active-book},
  keywords = {parallel I/O, architecture, pario-bib},
  comment = {They describe an I/O subsystem based on an ``active memory''
  called SWIM (Structured Wafer-based Intelligent Memory). SWIM chips are RAM
  chips with some built-in processing. The idea is that these tiny processors
  can manipulate the data in the chip at full speed, without dealing with
  memory bus or off-chip costs. Further, the chips can work in parallel. They
  demonstrate how they've used this to build a national phone database server,
  a high-performance IP router, and a call-screening agent.}
}

@InCollection{asthana:active-book,
  author = {Abhaya Asthana and Mark Cravatts and Paul Krzyzanowski},
  title = {An Experimental Memory-based {I/O} Subsystem},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {17},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {373--390},
  publisher = {Kluwer Academic Publishers},
  earlier = {asthana:active},
  keywords = {parallel I/O architecture, pario-bib},
  abstract = {We describe an I/O subsystem based on an active memory named SWIM
  (Structured Wafer-based Intelligent Memory) designed for efficient storage
  and manipulation of data structures. The key architectural idea in SWIM is to
  associate some processing logic with each memory chip that allows it to
  perform data manipulation operations locally and to communicate with a disk
  or a communication line through a backend port. The processing logic is
  specially designed to perform operations such as pointer dereferencing,
  memory indirection, searching and bounds checking efficiently. The I/O
  subsystem is built using an interconnected ensemble of such memory logic
  pairs. A complex processing task can now be distributed between a large
  number of small memory processors each doing a sub-task, while still
  retaining a common locus of control in the host CPU for higher level
  administrative and provisioning functions. We argue that active memory based
  processing enables more powerful, scalable and robust designs for storage and
  communications subsystems, that can support emerging network services,
  multimedia workstations and wireless PCS systems. A complete parallel
  hardware and software system constructed using an array of SWIM elements has
  been operational for over a year. We present results from application of SWIM
  to three network functions: a national phone database server, a high
  performance IP router, and a call screening agent.},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@InProceedings{avalani:channels,
  author = {Bhavan Avalani and Alok Choudhary and Ian Foster and Rakesh
  Kirshnaiyer},
  title = {Integrating Task and Data Parallelism Using Parallel {I/O}
  Techniques},
  booktitle = {Proceedings of the International Workshop on Parallel
  Processing},
  year = {1994},
  month = {December},
  address = {Bangalore, India},
  URL = {ftp://erc.cat.syr.edu/ece/choudhary/PASSION/task_data.ps.Z},
  keywords = {parallel I/O, pario-bib},
  comment = {They describe using the techniques of delrosario and debenedictis
  (although without mentioning them) to provide for channels (parallel pipes)
  between independent data-parallel tasks. The technique really is the same as
  in debenedictus and delrosario, although they extend it a bit to allow
  multiple "files" within a channel (why not use multiple channels)? Also, they
  depend on the program to read and write synchronization variables to control
  access to the flow of data through the channel. While this may provide good
  performance in some cases, why not have support for automatic flow control?
  The system can detect when a portion of the channel is written, and release
  readers waiting on that portion of the channel (if any). The paper is a bit
  confusing in its use of the word "file", which seems to be used to mean
  different things at different points. Also, they seem to use an arbitrary
  distribution for the "file", which may or may not be the same as one of those
  used by the two endpoints.}
}

@InProceedings{baer:grid-io,
  author = {Troy Baer and Pete Wyckoff},
  title = {A parallel {I/O} mechanism for distributed systems.},
  booktitle = {Proceedings of the 2004 IEEE International Conference on Cluster
  Computing},
  year = {2004},
  month = {September},
  pages = {63--69},
  publisher = {IEEE Computer Society Press},
  address = {San Diego, CA},
  keywords = {grid I/O, MPI-I/O, grid middleware, gridFTP, pario-bib},
  abstract = {Access to shared data is critical to the long term success of
  grids of distributed systems. As more parallel applications are being used on
  these grids, the need for some kind of parallel I/O facility across
  distributed systems increases. However, grid middleware has thus far had only
  limited support for distributed parallel I/O. In this paper, we present an
  implementation of the MPI-2 I/O interface using the Globus GridFTP client
  API. MPI is widely used for parallel computing, and its I/O interface maps
  onto a large variety of storage systems. The limitations of using GridFTP as
  an MPI-I/O transport mechanism are described, as well as support for parallel
  access to scientific data formats such as HDF and NetCDF. We compare the
  performance of GridFTP to that of NFS on the same network using several
  parallel I/O benchmarks. Our tests indicate that GridFTP can be a workable
  transport for parallel I/O, particularly for distributed read-only access to
  shared data sets. (26 refs.)}
}

@TechReport{bagrodia:sio-character,
  author = {Rajive Bagrodia and Andrew Chien and Yarsun Hsu and Daniel Reed},
  title = {Input/Output: Instrumentation, Characterization, Modeling and
  Management Policy},
  year = {1994},
  number = {CCSF-41},
  institution = {Scalable I/O Initiative},
  address = {Caltech Concurrent Supercomputing Facilities, Caltech},
  URL = {http://www.cacr.caltech.edu/SIO/pubs/SIO_perf.ps},
  keywords = {parallel I/O, pario-bib, prefetching, caching, multiprocessor
  file system, file access pattern},
  comment = {Basically there are two parts to this paper. First, they will
  instrument applications, Intel PFS, and IBM Vesta, to trace I/O-related
  activity. Then they'll use Pablo to analyze and characterize. They plan to
  trace some events in detail, and the rest with histogram counters. Second,
  they plan to develop caching and prefetching policies and to analyze those
  with simulation, analysis, and implementation. They note that IBM and Intel
  are developing parallel I/O architecture simulators. See also
  poole:sio-survey, choudhary:sio-language, bershad:sio-os.}
}

@InProceedings{bairavasundaram:x-ray,
  author = {Lakshmi N. Bairavasundaram and Muthian Sivathanu and Andrea C.
  Arpaci-Dusseau and Remzi H. Arpaci-Dusseau},
  title = {{X-RAY}: A non-invasive exclusive caching mechanism for {RAID}s},
  booktitle = {Proceedings of the 31st Annual International Symposium on
  Computer Architecture},
  year = {2004},
  month = {June},
  pages = {176--187},
  institution = {Univ Wisconsin, Dept Comp Sci, Madison, WI 53706 USA},
  publisher = {IEEE Computer Society Press},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  address = {Munich, Germany},
  URL = {http://www.cs.wisc.edu/wind/Publications/xray-isca04.html},
  keywords = {RAID, x-ray, caching policies, pario-bib},
  abstract = {RAID storage arrays often possess gigabytes of RAM for caching
  disk blocks. Currently, most RAID systems use LRU or LRU-like policies to
  manage these caches. Since these array caches do not recognize the presence
  of file system buffer caches, they redundantly retain many of the same blocks
  as those cached by the file system, thereby wasting precious cache space. In
  this paper, we introduce X-RAY, an exclusive RAID array caching mechanism.
  X-RAY achieves a high degree of (but not perfect) exclusivity through
  gray-box methods: by observing which files have been accessed through updates
  to file system meta-data, X-RAY constructs an approximate image of the
  contents of the file system cache and uses that information to determine the
  exclusive set of blocks that should be cached by the array. We use
  microbenchmarks to demonstrate that X-RAY's prediction of the file system
  buffer cache contents is highly accurate, and trace-based simulation to show
  that X-RAY considerably outperforms LRU and performs as well as other more
  invasive approaches. The main strength of the X-RAY approach is that it is
  easy to deploy -- all performance gains are achieved without changes to the
  SCSI protocol or the file system above.}
}

@InProceedings{baird:disa,
  author = {R. Baird and S. Karamooz and H. Vazire},
  title = {Distributed Information Storage Architecture},
  booktitle = {Proceedings of the Twelfth IEEE Symposium on Mass Storage
  Systems},
  year = {1993},
  pages = {145--155},
  keywords = {parallel I/O, distributed file system, mass storage, pario-bib},
  comment = {Architecture for distributed information storage. Integrates file
  systems, databases, etc. Single system image, lots of support for
  administration. O-O model, with storage device objects, logical device
  objects, volume objects, and file objects. Methods for each type of object,
  including administrative methods.}
}

@Article{bakker:semantic,
  author = {J.A. Bakker},
  title = {Semantic partitioning as a basis for parallel {I/O} in database
  management systems},
  journal = {Parallel Computing},
  year = {2000},
  month = {October},
  volume = {26},
  number = {11},
  pages = {1491--1513},
  URL = {http://dx.doi.org/10.1016/S0167-8191(00)00041-7},
  keywords = {database, parallel I/O, pario-bib},
  abstract = {Modern applications such as `video on demand' require fast
  reading of complete files, which can be supported well by file striping. Many
  conventional applications, however, are only interested in some part of the
  available records. In order to avoid reading attributes irrelevant to such
  applications, each attribute could be stored in a separate (transposed) file;
  Aiming at I/O parallelism, byte-oriented striping could be applied to
  transposed files. However, such a fragmentation ignores the semantics of
  data. This fragmentation cannot be optimized by a database management system
  (DBMS) because a DBMS has to perform its tasks on the basis of data
  semantics. For example, queries must be translated into file operations using
  a scheme that maps a data model to a file system. However, details about
  files, such as the striping width, are invisible to a DBMS. Therefore, we
  propose to store each transposed file related to a composite type on a
  separate, independent disk drive, which means I/O parallelism tuned to a data
  model. As we also aim at system reliability and data availability, each
  transposed file must be duplicated on another drive. Consequently, a DBMS
  also has to guarantee correctness and completeness of the allocation of
  transposed files within an array of disk drives. As a solution independent of
  the underlying data model, we propose an abstract framework consisting of a
  meta model and a set of rules}
}

@InProceedings{baldwin:hyperfs,
  author = {C. H. Baldwin and W. C. Nestlerode},
  title = {A Large Scale File Processing Application on a Hypercube},
  booktitle = {Proceedings of the Fifth Annual Distributed-Memory Computer
  Conference},
  year = {1990},
  pages = {1400-1404},
  keywords = {multiprocessor file system, file access pattern, parallel I/O,
  hypercube, pario-bib},
  comment = {Census-data processing on an nCUBE/10 at USC. Their program uses
  an interleaved pattern, which is like my lfp or gw with multi-record records
  (i.e., the application does its own blocking). Shifted to asynchronous I/O to
  do OBL manually. Better results if they did more computation per I/O (of
  course).}
}

@TechReport{baptist:fft,
  author = {Lauren M. Baptist},
  title = {Two Algorithms for Performing Multidimensional, Multiprocessor,
  Out-of-Core {FFTs}},
  year = {1999},
  month = {June},
  number = {PCS-TR99-350},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {Lauren M. Baptist},
  address = {Hanover, NH},
  URL = {ftp://ftp.cs.dartmouth.edu/TR/TR99-350.ps.Z},
  keywords = {parallel I/O, out of core, FFT, parallel algorithm, scientific
  computing, pario-bib},
  abstract = {We show two algorithms for computing multidimensional Fast
  Fourier Transforms (FFTs) on a multiprocessor system with distributed memory
  when problem sizes are so large that the data do not fit in the memory of the
  entire system. Instead, data reside on a parallel disk system and are brought
  into memory in sections. We use the Parallel Disk Model for implementation
  and analysis. \par The first method is a straightforward out-of-core variant
  of a well-known method for in-core, multidimensional FFTs. It performs
  1-dimensional FFT computations on each dimension in turn. This method is easy
  to generalize to any number of dimensions, and it also readily permits the
  individual dimensions to be of any sizes that are integer powers of~2. The
  key step is an out-of-core transpose operation that places the data along
  each dimension into contiguous positions on the parallel disk system so that
  the data for the 1-dimensional FFTs are contiguous.\par The second method is
  an adaptation of another well-known method for in-core, multidimensional
  FFTs. This method computes all dimensions simultaneously. It is more
  difficult to generalize to arbitrary radices and number of dimensions in this
  method than in the first method. Our present implementation is therefore
  limited to two dimensions of equal size, that are again integer powers of~2.
  \par We present I/O complexity analyses for both methods as well as empirical
  results for a DEC~2100 server and an SGI Origin~2000, each of which has a
  parallel disk system. Our results indicate that the methods are comparable in
  speed in two-dimensions.},
  comment = {Undergraduate Honors Thesis. Advisor: Tom Cormen.}
}

@TechReport{barak:hfs,
  author = {Amnon Barak and Bernard A. Galler and Yaron Farber},
  title = {A Holographic File System for a Multicomputer with Many Disk Nodes},
  year = {1988},
  month = {May},
  number = {88-6},
  institution = {Dept. of Computer Science, Hebrew University of Jerusalem},
  keywords = {parallel I/O, hashing, reliability, disk mirroring, pario-bib},
  comment = {Describes a file system for a distributed system that scatters
  records of each file over many disks using hash functions. The hash function
  is known by all processors, so no one processor must be up to access the
  file. Any portion of the file whose disknode is available may be accessed.
  Shadow nodes are used to take over for nodes that go down, saving the info
  for later use by the proper node. Intended to easily parallelize read/write
  accesses and global file operations, and to increase file availability.}
}

@InProceedings{barve:bus,
  author = {Rakesh Barve and Elizabeth Shriver and Phillip B. Gibbons and Bruce
  K. Hillyer and Yossi Matias and Jeffrey Scott Vitter},
  title = {Modeling and optimizing {I/O} throughput of multiple disks on a bus
  (summary)},
  booktitle = {Proceedings of the Joint International Conference on Measurement
  and Modeling of Computer Systems},
  year = {1998},
  month = {June},
  pages = {264--265},
  publisher = {ACM Press},
  later = {barve:bus2},
  URL =
  {http://www.acm.org/pubs/citations/proceedings/metrics/277851/p264-barve/},
  keywords = {disk model, I/O bus, device model, I/O model, pario-bib}
}

@InProceedings{barve:bus2,
  author = {Rakesh Barve and Jeffrey Vitter and Elizabeth Shriver and Phillip
  Gibbons and Bruce Hillyer and Yossi Matias},
  title = {Modeling and Optimizing {I/O} Throughput of Multiple Disks on a
  Bus},
  booktitle = {Proceedings of the 1999 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1999},
  month = {June},
  pages = {83--92},
  publisher = {ACM Press},
  earlier = {barve:bus},
  URL =
  {http://www.acm.org/pubs/citations/proceedings/metrics/301453/p83-barve/},
  keywords = {disk model, I/O bus, device model, I/O model, pario-bib}
}

@InProceedings{barve:competitive2,
  author = {Rakesh Barve and Mahesh Kallahalla and Peter J. Varman and Jeffrey
  Scott Vitter},
  title = {Competitive Parallel Disk Prefetching and Buffer Management},
  booktitle = {Proceedings of the Fifth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1997},
  month = {November},
  pages = {47--56},
  publisher = {ACM Press},
  address = {San Jose, CA},
  URL = {http://doi.acm.org/10.1145/266220.266225},
  keywords = {disk prefetching, file caching, parallel I/O, pario-bib},
  abstract = {We provide a competitive analysis framework for online
  prefetching and buffer management algorithms in parallel I/O systems, using a
  read-once model of block references. This has widespread applicability to key
  I/O-bound applications such as external merging and concurrent playback of
  multiple video streams. Two realistic lookahead models, global lookahead and
  local lookahead, are defined. Algorithms NOM and GREED based on these two
  forms of lookahead are analyzed for shared buffer and distributed buffer
  configurations, both of which occur frequently in existing systems. An
  important aspect of our work is that we show how to implement both the models
  of lookahead in practice using the simple techniques of forecasting and
  flushing. \par Given a D-disk parallel I/O system and a globally shared I/O
  buffer that can hold upto M disk blocks, we derive a lower bound of
  $\Omega(\sqrt{D}$) on the competitive ratio of any deterministic online
  prefetching algorithm with O(M) lookahead. NOM is shown to match the lower
  bound using global M-block lookahead. In contrast, using only local lookahead
  results in an $\Omega(D)$ competitive ratio. When the buffer is distributed
  into D portions of M/D blocks each, the algorithm GREED based on local
  lookahead is shown to be optimal, and NOM is within a constant factor of
  optimal. Thus we provide a theoretical basis for the intuition that global
  lookahead is more valuable for prefetching in the case of a shared buffer
  configuration whereas it is enough to provide local lookahead in case of the
  distributed configuration. Finally, we analyze the performance of these
  algorithms for reference strings generated by a uniformly-random stochastic
  process and we show that they achieve the minimal expected number of I/Os.
  These results also give bounds on the worst-case expected performance of
  algorithms which employ randomization in the data layout.},
  comment = {See also barve:competitive. They propose two methods for
  scheduling prefetch operations in the situation where the access pattern is
  largely known in advance, in such a way as to minimize the total number of
  parallel I/Os. The two methods are quite straightforward, and yet match the
  optimum lower bound for an on-line algorithm.}
}

@Article{barve:jmergesort,
  author = {Rakesh D. Barve and Edward F. Grove and Jeffrey S. Vitter},
  title = {Simple Randomized Mergesort on Parallel Disks},
  journal = {Parallel Computing},
  year = {1997},
  month = {June},
  volume = {23},
  number = {4},
  pages = {601--631},
  publisher = {North-Holland (Elsevier Scientific)},
  earlier = {barve:mergesort},
  URL = {file://cs.duke.edu/pub/jsv/Papers/BGV96.Simple_Mergesort.ps.gz},
  keywords = {parallel I/O algorithm, sorting, pario-bib},
  abstract = {We consider the problem of sorting a file of N records on the
  D-disk model of parallel I/O in which there are two sources of parallelism.
  Records are transferred to and from disk concurrently in blocks of B
  contiguous records. In each I/O operation, up to one block can be transferred
  to or from each of the D disks in parallel. We propose a simple, efficient,
  randomized mergesort algorithm called SRM that uses a forecast-and-flush
  approach to overcome the inherent difficulties of simple merging on parallel
  disks. SRM exhibits a limited use of randomization and also has a useful
  deterministic version. Generalizing the technique of forecasting, our
  algorithm is able to read in, at any time, the ``right'' block from any disk,
  and using the technique of flushing, our algorithm evicts, without any I/O
  overhead, just the ``right'' blocks from memory to make space for new ones to
  be read in. The disk layout of SRM is such that it enjoys perfect write
  parallelism, avoiding fundamental inefficiencies of previous mergesort
  algorithms. By analysis of generalized maximum occupancy problems we are able
  to derive an analytical upper bound on SRM's expected overhead valid for
  arbitrary inputs. \par The upper bound derived on expected I/O performance of
  SRM indicates that SRM is provably better than disk-striped mergesort (DSM)
  for realistic parameter values D, M, and B. Average-case simulations show
  further improvement on the analytical upper bound. Unlike previously proposed
  optimal sorting algorithms, SRM outperforms DSM even when the number D of
  parallel disks is small.},
  comment = {This paper formerly called barve:mergesort; I discovered that the
  paper had appeared in SPAA96, so the SPAA96 paper is now called
  barve:mergesort.}
}

@InProceedings{barve:mergesort,
  author = {Rakesh D. Barve and Edward F. Grove and Jeffrey S. Vitter},
  title = {Simple Randomized Mergesort on Parallel Disks},
  booktitle = {Proceedings of the Eighth Symposium on Parallel Algorithms and
  Architectures},
  year = {1996},
  month = {June},
  pages = {109--118},
  publisher = {ACM Press},
  address = {Padua, Italy},
  later = {barve:jmergesort},
  keywords = {parallel I/O algorithm, sorting, pario-bib}
}

@InProceedings{barve:round,
  author = {Rakesh Barve and Phillip B. Gibbons and Bruce K. Hillyer and Yossi
  Matias and Elizabeth Shriver and Jeffrey Scott Vitter},
  title = {Round-like Behavior in Multiple Disks on a Bus},
  booktitle = {Proceedings of the Sixth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1999},
  month = {May},
  pages = {1--9},
  publisher = {ACM Press},
  address = {Atlanta, GA},
  URL = {http://vibes.cs.uiuc.edu/IOPADS/Accepted/Shriver.ps},
  keywords = {disk, I/O bus, parallel I/O, pario-bib},
  abstract = {In modern I/O architectures, multiple disk drives are attached to
  each I/O bus. Under I/O-intensive workloads, the disk latency for a request
  can be overlapped with the disk latency and data transfers of requests to
  other disks, potentially resulting in an aggregate I/O throughput at nearly
  bus bandwidth. This paper reports on a performance impairment that results
  from a previously unknown form of convoy behavior in disk I/O, which we call
  rounds. In rounds, independent requests to distinct disks convoy, so that
  each disk services one request before any disk services its next request. We
  analyze log files to describe read performance of multiple Seagate Wren-7
  disks that share a SCSI bus under a heavy workload, demonstrating the rounds
  behavior and quantifying its performance impact.}
}

@Article{batcher:staran,
  author = {K. E. Batcher},
  title = {{STARAN} Parallel Processor System Hardware},
  journal = {AFIPS Conference Proceedings},
  year = {1974},
  pages = {405--410},
  keywords = {parallel architecture, array processor, parallel I/O, SIMD,
  pario-bib},
  comment = {This paper is reproduced in Kuhn and Padua's (1981, IEEE) survey
  ``Tutorial on Parallel Processing.'' The STARAN is an array processor that
  uses Multi-Dimensional-Access (MDA) memories and permutation networks to
  access data in bit slices in a variety of ways, with high-speed I/O
  capabilities. Its router (called the {\em flip} network) could permute data
  among the array processors, or between the array processors and external
  devices, including disks, video input, and displays.}
}

@InProceedings{baylor:methodology,
  author = {Sandra Johnson Baylor and Caroline Benveniste and Leo J.
  Beolhouwer},
  title = {A Methodology for Evaluating Parallel {I/O} Performance for
  Massively Parallel Processors},
  booktitle = {Proceedings of the 27th Annual Simulation Symposium},
  year = {1994},
  month = {April},
  pages = {31--40},
  keywords = {parallel I/O, parallel architecture, simulation, pario-bib}
}

@InProceedings{baylor:perfeval,
  author = {Sandra Johnson Baylor and Caroline B. Benveniste and Yarsun Hsu},
  title = {Performance Evaluation of a Parallel {I/O} Architecture},
  booktitle = {Proceedings of the 9th ACM International Conference on
  Supercomputing},
  year = {1995},
  month = {July},
  pages = {404--413},
  publisher = {ACM Press},
  address = {Barcelona},
  earlier = {baylor:perfeval-tr},
  keywords = {performance evaluation, parallel architecture, parallel I/O,
  pario-bib},
  comment = {They use a simulator to evaluate the performance of a parallel I/O
  system. They simulate the network and disks under a synthetic workload, and
  measure the time it takes for I/O requests to traverse the network, be
  processed, and return. They also measure the impact of I/O requests on
  non-I/O messages. Their results are fairly unsurprising.}
}

@TechReport{baylor:perfeval-tr,
  author = {Sandra Johnson Baylor and Caroline B. Benveniste and Yarsun Hsu},
  title = {Performance Evaluation of a Parallel {I/O} Architecture},
  year = {1995},
  month = {May},
  number = {RC~20049},
  institution = {IBM T.~J. Watson Research Center},
  later = {baylor:perfeval},
  keywords = {performance evaluation, parallel architecture, parallel I/O,
  pario-bib}
}

@InProceedings{baylor:vulcan-perf,
  author = {Sandra Johnson Baylor and Caroline Benveniste and Yarsun Hsu},
  title = {Performance Evaluation of a Massively Parallel {I/O} Subsystem},
  booktitle = {Proceedings of the IPPS~'94 Workshop on Input/Output in Parallel
  Computer Systems},
  year = {1994},
  pages = {1--15},
  organization = {IBM Watson Research Center},
  note = {Also appeared in Computer Architecture News 22(4)},
  later = {baylor:vulcan-perf-book},
  keywords = {parallel I/O, parallel architecture, performance analysis,
  pario-bib},
  comment = {See polished version baylor:vulcan-perf-book. Simulation of the
  I/O architecture for the Vulcan MPP at IBM TJW. This is a distributed-memory
  MIMD system with a bidirectional omega-type interconnection network, and
  separate compute and I/O nodes. They use a stochastic workload to evaluate
  the average I/O performance under a few different situations, and then use
  that average performance, along with a stochastic workload, in a detailed
  simulation of the interconnection network. (What would be the effect of
  adding variance to the I/O-node performance?) A key point is that the I/O
  node will not accept any more requests until a current write request is
  finished being processed (copied into the write-back cache). If there are
  many writes, this can backup the network (would a different write-request
  protocol help?) Not clear how concurrency of reads are modeled. Results show
  that network saturates for high request rates and small number of I/O nodes.
  As request rate decreases or number of I/O nodes increases, performance
  levels off to a reasonable value. Placement of I/O nodes didn't make much
  difference, nor did extra non-I/O traffic. Given their parameters, and for
  reasonable loads, 1 I/O node per 4 compute nodes was a reasonable balance,
  and was scalable.}
}

@InCollection{baylor:vulcan-perf-book,
  author = {Sandra Johnson Baylor and Caroline Benveniste and Yarsun Hsu},
  title = {Performance Evaluation of a Massively Parallel {I/O} Subsystem},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {13},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {293--311},
  publisher = {Kluwer Academic Publishers},
  earlier = {baylor:vulcan-perf},
  keywords = {parallel I/O architecture, performance evaluation, pario-bib},
  abstract = {Presented are the trace-driven simulation results of a study
  conducted to evaluate the performance of the internal parallel I/O subsystem
  of the Vulcan massively parallel processor (MPP) architecture. The system
  sizes evaluated vary from 16 to 512 nodes. The results show that a compute
  node to I/O node ratio of four is the most cost effective for all system
  sizes, suggesting high scalability. Also, processor-to-processor
  communication effects are negligible for small message sizes and the greater
  the fraction of I/O reads, the better the I/O performance. Worse case I/O
  node placement is within 13\% of more efficient placement strategies.
  Introducing parallelism into the internal I/O subsystem improves I/O
  performance significantly.},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@InProceedings{baylor:workload,
  author = {Sandra Johnson Baylor and C. Eric Wu},
  title = {Parallel {I/O} Workload Characteristics Using {Vesta}},
  booktitle = {Proceedings of the IPPS~'95 Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1995},
  month = {April},
  pages = {16--29},
  later = {baylor:workload-book},
  keywords = {parallel I/O, workload characterization, pario-bib},
  abstract = {In recent years, the design and performance evaluation of
  parallel processors has focused on the processor, memory and communication
  subsystems. As a result, these subsystems have better performance potential
  than the I/O subsystem. In fact, the I/O subsystem is the bottleneck in many
  machines. However, there are a number of studies currently underway to
  improve the design of parallel I/O subsystems. To develop optimal parallel
  I/O subsystem designs, one must have a thorough understanding of the workload
  characteristics of parallel I/O and its exploitation of the associated
  parallel file system. Presented are the results of a study conducted to
  analyze the parallel I/O workloads of several applications on a parallel
  processor using the Vesta parallel file system. Traces of the applications
  are obtained to collect system events, communication events, and parallel I/O
  events. The traces are then analyzed to determine workload characteristics.
  The results show I/O request rates on the order of hundreds of requests per
  second, a large majority of requests are for small amount of data (less than
  1500 bytes), a few requests are for large amounts of data (on the order of
  megabytes), significant file sharing among processes within a job, and strong
  temporal, traditional spatial, and interprocess spatial locality.},
  comment = {See polished version baylor:workload-book. They characterize four
  parallel applications: sort, matrix multiply, seismic migration, and video
  server, in terms of their I/O activity. They found results that are
  consistent with kotz:workload, in that they also found lots of small data
  requests, some large data requests, significant file sharing and interprocess
  locality. This study found less of the non-contiguous access than did
  kotz:workload, because of the logical views provided by Vesta. Note on-line
  postscript does not include figures.}
}

@InCollection{baylor:workload-book,
  author = {Sandra Johnson Baylor and C. Eric Wu},
  title = {Parallel {I/O} Workload Characteristics Using {Vesta}},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {7},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {167--185},
  publisher = {Kluwer Academic Publishers},
  earlier = {baylor:workload},
  keywords = {parallel I/O, file access pattern, workload characterization,
  file system workload, pario-bib},
  abstract = {To develop optimal parallel I/O subsystems, one must have a
  thorough understanding of the workload characteristics of parallel I/O and
  its exploitation of the associated parallel file system. Presented are the
  results of a study conducted to analyze the parallel I/O workloads of several
  applications on a parallel processor using the Vesta parallel file system.
  Traces of the applications are obtained to collect system events,
  communication events, and parallel I/O events. The traces are then analyzed
  to determine workload characteristics. The results show I/O request rates on
  the order of hundreds of requests per second, a large majority of requests
  are for small amounts of data (less than 1500 bytes), a few requests are for
  large amounts of data (on the order of megabytes), significant file sharing
  among processes within a job, and strong temporal, traditional spatial, and
  interprocess spatial locality.},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@Manual{bbn:admin,
  key = {BBN},
  author = {BBN Advanced {Computers Inc.}},
  title = {{TC2000} System Administration Guide},
  edition = {Revision 3.0},
  year = {1991},
  month = {April},
  keywords = {BBN, parallel I/O, pario-bib},
  comment = {Administrative manual for the TC2000 I/O system. Can stripe over
  partitions in a user-specified set of disks. Large requests automatically
  split and done in parallel. See also garber:tc2000.}
}

@TechReport{becher:ooc-solver,
  author = {Jonathan D. Becher and John F. Porter},
  title = {Out of Core Dense Solvers for the {MasPar} Parallel Computer},
  year = {1994},
  number = {MP/IP/SP-37.94},
  institution = {MasPar Computer Corporation},
  keywords = {parallel I/O, scientific computing, linear algebra, pario-bib},
  comment = {They look at out-of-core block and slab solvers for the Maspar.
  They overlap reading one block with the computation of the previous block.
  They solve matrices up to 40k x 40k, and obtain 3.14 GFlops even with I/O
  considered.}
}

@InProceedings{bell:physics,
  author = {Jean L. Bell},
  title = {A Specialized Data Management System for Parallel Execution of
  Particle Physics Codes},
  booktitle = {Proceedings of the ACM SIGMOD International Conference on
  Management of Data},
  year = {1988},
  pages = {277--285},
  publisher = {ACM Press},
  address = {Chicago, IL},
  keywords = {file access pattern, disk prefetch, file system, pario-bib},
  comment = {A specialized database system for particle physics codes. Valuable
  for its description of access patterns and subsequent file access
  requirements. Particle-in-cell codes iterate over timesteps, updating the
  position of each particle, and then the characteristics of each cell in the
  grid. Particles may move from cell to cell. Particle update needs itself and
  nearby gridcell data. The whole dataset is too big for memory, and each
  timestep must be stored on disk for later analysis anyway. Regular file
  systems are inadequate: specialized DBMS is more appropriate. Characteristics
  needed by their application class: multidimensional access (by particle type
  or by location, i.e., multiple views of the data), coordination between grid
  and particle data, coordination between processors, coordinated access to
  meta-data, inverted files, horizontal clustering, large blocking of data,
  asynchronous I/O, array data, complicated joins, and prefetching according to
  user-prespecified order. Note that many of these things can be provided by a
  file system, but that most are hard to come by in typical file systems, if
  not impossible. Many of these features are generalizable to other
  applications.}
}

@InProceedings{benner:pargraphics,
  author = {Robert E. Benner},
  title = {Parallel Graphics Algorithms on a 1024-Processor Hypercube},
  booktitle = {Proceedings of the Fourth Conference on Hypercube Concurrent
  Computers and Applications},
  year = {1989},
  pages = {133--140},
  publisher = {Golden Gate Enterprises, Los Altos, CA},
  address = {Monterey, CA},
  keywords = {hypercube, graphics, parallel algorithm, parallel I/O,
  pario-bib},
  comment = {About using the nCUBE/10's RT Graphics System. They were
  frustrated by an unusual mapping from the graphics memory to the display, a
  shortage of memory on the graphics nodes, and small message buffers on the
  graphics nodes. They wrote some algorithms for collecting the columns of
  pixels from the hypercube nodes, and routing them to the appropriate graphics
  node. They also would have liked a better interconnection network between the
  graphics nodes, at least for synchronization.}
}

@InProceedings{bennett:jovian,
  author = {Robert Bennett and Kelvin Bryant and Alan Sussman and Raja Das and
  Joel Saltz},
  title = {{Jovian}: A Framework for Optimizing Parallel {I/O}},
  booktitle = {Proceedings of the Scalable Parallel Libraries Conference},
  year = {1994},
  month = {October},
  pages = {10--20},
  publisher = {IEEE Computer Society Press},
  address = {Mississippi State, MS},
  URL = {ftp://hpsl.cs.umd.edu/pub/papers/splc94.ps.Z},
  keywords = {parallel I/O, pario-bib},
  comment = {Jovian is a runtime library for use with SPMD codes, eg, HPF. They
  restrict IO to collective operations, and provide extra processes to
  'coalesce' the many requests from multiple CPs into fewer larger requests to
  the operating system, perhaps optimized for access order. They mention that
  there is a standardization process underway for specifying data
  distributions. Also a compact representation for strided access to
  n-dimensional data structures. Coalescing basically means combining requests
  to eliminate duplication and to combine adjacent requests. Requests to
  coalescers are in full blocks, to lower the processing overhead. Nonetheless,
  their method involves moving requests around twice, and involve several
  memory-memory copies of the data, so their overhead is high.}
}

@Misc{berdahl:transport,
  author = {Lawrence Berdahl},
  title = {Parallel Transport Protocol Proposal},
  year = {1995},
  month = {January 3,},
  howpublished = {Lawrence Livermore National Labs},
  note = {Draft},
  earlier = {berdahl:woodenman},
  URL = {ftp://ftp.cs.dartmouth.edu/pub/pario/berdahl:transport.ps.Z},
  keywords = {parallel I/O, network, supercomputer system, pario-bib},
  comment = {An update of berdahl:woodenman, close to the final draft.}
}

@Misc{berdahl:woodenman,
  author = {Lawrence Berdahl},
  title = {Parallel Data Exchange},
  year = {1994},
  month = {January 28,},
  howpublished = {Lawrence Livermore National Labs},
  note = {WoodenMan Proposal},
  later = {berdahl:transport},
  keywords = {parallel I/O, network, supercomputer system, pario-bib},
  comment = {They describe a protocol for making parallel data transfers of
  arbitrary data sets from one set of data servers to another set of data
  servers. The goal is to be independent of specific architectures or even
  types of data servers, and to work on top of existing transport protocols.
  The data set is described using a gather set for the source and a scatter set
  for the destination, and using a linear address space as an intermediate
  representation. All the servers are contacted, they figure out who they need
  to talk, and exchange port information with them. Each pair exchanges votes
  on who will control the transfer (ie, who will control the order of the
  transfer), and on their maximum data rates. This information is used to
  settle on the control and set of ports to be used. This proposal is not final
  and is under active development, so it may change.}
}

@Article{berrendorf:paragon,
  author = {R. Berrendorf and H. Burg and U. Detert},
  title = {Performance Characteristics of Parallel Computers: {Intel Paragon}
  Case Study},
  journal = {{IT+TI} Informationstechnik und Technische Informatik},
  year = {1995},
  month = {April},
  volume = {37},
  number = {2},
  pages = {37--45},
  note = {(In German).},
  keywords = {parallel computing, performance evaluation, parallel file system,
  pario-bib},
  comment = {In German. They summarize typical performance of the Intel
  Paragon, including the communication performance and the parallel file-system
  performance.}
}

@InProceedings{berry:nasa,
  author = {Michael R. Berry and Tarek A. El-Ghazawi},
  title = {Parallel Input/Output Characteristics of {NASA} Science
  Applications},
  booktitle = {Proceedings of the Tenth International Parallel Processing
  Symposium},
  year = {1996},
  month = {April},
  pages = {741--747},
  publisher = {IEEE Computer Society Press},
  address = {Honolulu},
  URL = {http://computer.org/proceedings/ipps/7255/7255toc.htm},
  keywords = {scientific computation, application, parallel I/O, pario-bib}
}

@TechReport{bershad:sio-os,
  author = {Brian Bershad and David Black and David DeWitt and Garth Gibson and
  Kai Li and Larry Peterson and Marc Snir},
  title = {Operating System Support for High-Performance Parallel {I/O}
  Systems},
  year = {1994},
  number = {CCSF-40},
  institution = {Scalable I/O Initiative},
  address = {Caltech Concurrent Supercomputing Facilities, Caltech},
  URL = {http://www.cacr.caltech.edu/SIO/pubs/SIO_osfs.ps},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {Four major components: networking, memory servers, file system,
  and persistent object store. Networking part focuses on low-latency support
  communication within an application, between applications, and between
  machines (Bershad and Peterson). Memory servers, shared virtual memory, and
  checkpointing support (Kai Li). File systems support includes benchmarking,
  transparent informed prefetching (Gibson), a common interface for PFS and
  Vesta (Snir), and integrating secondary and tertiary storage systems
  (including the integration of the National Storage Lab's HPSS (see
  coyne:hpss) into this project in 1995). OSF/1 (Black) will be extended to
  support parallel file systems, extent-like behavior, and block coalescing.
  Persistent object store (DeWitt) is radical change to an object-oriented
  interface, transparent I/O (though extensible and changable with subclassing,
  presumably), and heterogeneous support via the Object Definition Language
  standard. Persistent objects may be integrated with the memory servers and
  shared virtual memory. See also poole:sio-survey, bagrodia:sio-character,
  choudhary:sio-language.}
}

@InProceedings{berson:multimedia,
  author = {Steven Berson and Leana Golubchik and Richard R. Muntz},
  title = {Fault Tolerant Design of Multimedia Servers},
  booktitle = {Proceedings of the ACM SIGMOD International Conference on
  Management of Data},
  year = {1995},
  pages = {364--375},
  publisher = {ACM Press},
  keywords = {fault tolerance, multimedia, video on demand, parallel I/O,
  pario-bib}
}

@InProceedings{best:cmmdio,
  author = {Michael L. Best and Adam Greenberg and Craig Stanfill and Lewis W.
  Tucker},
  title = {{CMMD I/O}: A Parallel {Unix I/O}},
  booktitle = {Proceedings of the Seventh International Parallel Processing
  Symposium},
  year = {1993},
  pages = {489--495},
  publisher = {IEEE Computer Society Press},
  address = {Newport Beach, CA},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {Much like Intel CFS, with different I/O modes that determine when
  the compute nodes synchronize, and the semantics of I/Os written to the file.
  They found it hard to get good bandwidth for independent I/Os, as opposed to
  coordinated I/Os; part of this was due to their RAID~3 disk array, but it is
  more complicated than that. Some performance numbers were given in talk.}
}

@InProceedings{bestavros:raid,
  author = {Azer Bestavros},
  title = {{IDA}-Based Redundant Arrays of Inexpensive Disks},
  booktitle = {Proceedings of the First International Conference on Parallel
  and Distributed Information Systems},
  year = {1991},
  month = {December},
  pages = {2--9},
  keywords = {RAID, disk array, reliability, parallel I/O, pario-bib},
  comment = {Uses the Information Dispersal Algorithm (IDA) to generate $n+m$
  blocks from $n$ blocks, to tolerate $m$ disk failures; all of the data from
  the $n$ blocks is hidden in the $n+m$ blocks. Not with the RAID project.}
}

@InProceedings{bester:gass,
  author = {Joseph Bester and Ian Foster and Carl Kesselman and Jean Tedesco
  and Steven Tuecke},
  title = {{GASS}: A Data Movement and Access Service for Wide Area Computing
  Systems},
  booktitle = {Proceedings of the Sixth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1999},
  month = {May},
  pages = {78--88},
  publisher = {ACM Press},
  address = {Atlanta, GA},
  URL = {http://vibes.cs.uiuc.edu/IOPADS/Accepted/Tedesco.ps},
  keywords = {wide-area network, parallel I/O, pario-bib},
  abstract = {In wide area computing, programs frequently execute at sites that
  are distant from their data. Data access mechanisms are required that place
  limited functionality demands on an application or host system yet permit
  high-performance implementations. To address these requirements, we propose a
  data movement and access service called Global Access to Secondary Storage
  (GASS). This service defines a global name space via Uniform Resource
  Locators and allows applications to access remote files via standard I/O
  interfaces. High performance is achieved by incorporating default data
  movement strategies that are specialized for I/O patterns common in wide area
  applications and by providing support for programmer management of data
  movement. GASS forms part of the Globus toolkit, a set of services for
  high-performance distributed computing. GASS itself makes use of Globus
  services for security and communication, and other Globus components use GASS
  services for executable staging and real-time remote monitoring. Application
  experiences demonstrate that the library has practical utility.}
}

@InProceedings{beynon:datacutter,
  author = {Michael D. Beynon and Renato Ferreira and Tahsin Kurc and Alan
  Sussman and Joel Saltz},
  title = {{DataCutter}: Middleware for Filtering Very Large Scientific
  Datasets on Archival Storage Systems},
  booktitle = {Proceedings of the 2000 Mass Storage Systems Conference},
  year = {2000},
  month = {March},
  pages = {119--133},
  publisher = {IEEE Computer Society Press},
  address = {College Park, MD},
  keywords = {data grid, filter, pario-bib}
}

@InProceedings{bitton:schedule,
  author = {Dina Bitton},
  title = {Arm Scheduling in Shadowed Disks},
  booktitle = {Proceedings of IEEE Compcon},
  year = {1989},
  month = {Spring},
  pages = {132--136},
  keywords = {parallel I/O, disk shadowing, reliability, disk mirroring, disk
  optimization, pario-bib},
  comment = {Goes further than bitton:shadow. Uses simulation to verify results
  from that paper, which were expressions for the expected seek distance of
  shadowed disks, using shortest-seek-time arm scheduling. Problem is her
  assumption that arm positions stay independent, in the face of correlating
  effects like writes, which move all arms to the same place. Simulations match
  model only barely, and only in some cases. Anyway, shadowed disks can improve
  performance for workloads more than 60 or 70\% reads.}
}

@InProceedings{bitton:shadow,
  author = {D. Bitton and J. Gray},
  title = {Disk Shadowing},
  booktitle = {Proceedings of the 14th International Conference on Very Large
  Data Bases},
  year = {1988},
  pages = {331--338},
  keywords = {parallel I/O, disk shadowing, reliability, disk mirroring, disk
  optimization, pario-bib},
  comment = {Also TR UIC EECS 88-1 from Univ of Illinois at Chicago. Shadowed
  disks are mirroring with more than 2 disks. Writes to all disks, reads from
  one with shortest seek time. Acknowledges but ignores problem posed by
  lo:disks. Also considers that newer disk technology does not have linear seek
  time $(a+bx)$ but rather $(a+b\sqrt{x})$. Shows that with either seek
  distribution the average seek time for workloads with at least 60\% reads
  decreases in the number of disks. See also bitton:schedule.}
}

@InProceedings{bjorstad:structure,
  author = {P. E. Bj{\o}rstad and J. Cook},
  title = {Large Scale Structural Analysis On Massively Parallel Computers},
  booktitle = {Linear Algebra for Large Scale and Real-Time Applications},
  year = {1993},
  pages = {3--11},
  publisher = {Kluwer Academic Publishers},
  note = {ftp from ftp.ii.uib.no in \verb+pub/tech_reports/mpp_sestra.ps.Z+.},
  URL = {file://ftp.ii.uib.no/pub/tech_reports/mpp_sestra.ps.Z},
  keywords = {parallel I/O, file access pattern, pario-bib},
  comment = {A substantial part of this structural-analysis application was
  involved in I/O, moving substructures in and out of RAM. The Maspar IO-RAM
  helped a lot, nearly halving the time required. On the Cray, the SSD had an
  even bigger impact, perhaps 7--12 times faster. Their main conclusion is that
  caching helped. Most likely this was due to its double-buffering, since they
  structured the code to read/compute/write in large ``superblocks''.}
}

@InCollection{blaum:evenodd,
  author = {Mario Blaum and Jim Brady and Jehoshua Bruck and Jai Menon and
  Alexander Vardy},
  title = {The {EVENODD} Code and its Generalization: An Efficient Scheme for
  Tolerating Multiple Disk Failures in {RAID} Architectures},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {14},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {187--208},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {disk array, RAID, parallel I/O, pario-bib},
  comment = {Part of jin:io-book.}
}

@Article{bonachea:java-io,
  author = {Dan Bonachea and Phillip Dickens and Rajeev Thakur},
  title = {High-Performance File {I/O} in {Java}: Existing Approaches and Bulk
  {I/O} Extensions},
  journal = {Concurrency and Computation: Practice and Experience},
  year = {2001},
  volume = {13},
  number = {8--9},
  pages = {713--736},
  earlier = {bonachea:java-io-tr},
  URL = {http://www.mcs.anl.gov/~thakur/papers/javaio-journal.ps},
  keywords = {parallel I/O, Java, file system interface, pario-bib},
  abstract = {There is a growing interest in using Java as the language for
  developing high-performance computing applications. To be successful in the
  high-performance computing domain, however, Java must not only be able to
  provide high computational performance, but also high-performance I/O. In
  this paper, we first examine several approaches that attempt to provide
  high-performance I/O in Java---many of which are not obvious at first
  glance---and evaluate their performance on two parallel machines, the IBM SP
  and the SGI Origin2000. We then propose extensions to the Java I/O library
  that address the deficiencies in the Java I/O API and improve performance
  dramatically. The extensions add bulk (array) I/O operations to Java, thereby
  removing much of the overhead currently associated with array I/O in Java. We
  have implemented the extensions in two ways: in a standard JVM using the Java
  Native Interface (JNI) and in a high-performance parallel dialect of Java
  called Titanium. We describe the two implementations and present performance
  results that demonstrate the benefits of the proposed extensions.}
}

@TechReport{bonachea:java-io-tr,
  author = {Dan Bonachea and Phillip Dickens and Rajeev Thakur},
  title = {High-Performance File {I/O} in {Java}: Existing Approaches and Bulk
  {I/O} Extensions},
  year = {2000},
  month = {August},
  number = {ANL/MCS-P840-0800},
  institution = {Mathematics and Computer Science Division, Argonne National
  Laboratory},
  later = {bonachea:java-io},
  URL = {http://www.mcs.anl.gov/~thakur/papers/javaio-journal.ps},
  keywords = {parallel I/O, java, file system interface, pario-bib},
  abstract = {There is a growing interest in using Java as the language for
  developing high-performance computing applications. To be successful in the
  high-performance computing domain, however, Java must not only be able to
  provide high computational performance, but also high-performance I/O. In
  this paper, we first examine several approaches that attempt to provide
  high-performance I/O in Java---many of which are not obvious at first
  glance---and evaluate their performance on two parallel machines, the IBM SP
  and the SGI Origin2000. We then propose extensions to the Java I/O library
  that address the deficiencies in the Java I/O API and improve performance
  dramatically. The extensions add bulk (array) I/O operations to Java, thereby
  removing much of the overhead currently associated with array I/O in Java. We
  have implemented the extensions in two ways: in a standard JVM using the Java
  Native Interface (JNI) and in a high-performance parallel dialect of Java
  called Titanium. We describe the two implementations and present performance
  results that demonstrate the benefits of the proposed extensions.}
}

@Article{boral:bubba,
  author = {Haran Boral and William Alexander and Larry Clay and George
  Copeland and Scott Danforth and Michael Franklin and Brian Hart and Marc
  Smith and Patrick Valduriez},
  title = {Prototyping {Bubba}, a Highly Parallel Database System},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  year = {1990},
  month = {March},
  volume = {2},
  number = {1},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, database, disk caching, pario-bib},
  comment = {More recent than copeland:bubba, and a little more general. This
  gives few details, and doesn't spend much time on the parallel I/O. Bubba
  does use parallel independent disks, with a significant effort to place data
  on the disks, and do the work local to the disks, to balance the load and
  minimize interprocessor communication. Also they use a single-level store
  (i.e., memory-mapped files) to improve performance of their I/O system,
  including page locking that is assisted by the MMU. The OS has hooks for the
  database manager to give memory-management policy hints.}
}

@InProceedings{boral:critique,
  author = {H. Boral and D. {DeWitt}},
  title = {Database machines: an idea whose time has passed?},
  booktitle = {Proceedings of the Second International Workshop on Database
  Machines},
  year = {1983},
  pages = {166--187},
  publisher = {Springer-Verlag},
  keywords = {file access pattern, parallel I/O, database machine, pario-bib},
  comment = {Improvements in I/O bandwidth crucial for supporting database
  machines, otherwise highly parallel DB machines are useless (I/O bound). Two
  ways to do it: 1) synchronized interleaving by using custom controller and
  regular disks to read/write same track on all disks, which speeds individual
  accesses. 2) use very large cache (100-200M) to keep blocks to re-use and to
  do prefetching. But see dewitt:pardbs.}
}

@InProceedings{bordawekar:collective,
  author = {Rajesh Bordawekar},
  title = {Implementation of Collective {I/O} in the {Intel Paragon} Parallel
  File System: Initial Experiences},
  booktitle = {Proceedings of the 11th ACM International Conference on
  Supercomputing},
  year = {1997},
  month = {July},
  pages = {20--27},
  publisher = {ACM Press},
  earlier = {bordawekar:collective-tr},
  URL = {http://www.cacr.caltech.edu/~rajesh/ics97.ps},
  keywords = {collective I/O, multiprocessor file system, parallel I/O,
  pario-bib},
  comment = {bordawekar:collective was renamed bordawekar:collective-tr, so
  this could be called bordawekar:collective.}
}

@TechReport{bordawekar:collective-tr,
  author = {Rajesh Bordawekar},
  title = {Implementation and Evaluation of Collective {I/O} in the {Intel
  Paragon Parallel File System}},
  year = {1996},
  month = {November},
  number = {CACR~TR-128},
  institution = {Center of Advanced Computing Research, California Insititute
  of Technology},
  later = {bordawekar:collective},
  URL = {http://www.cacr.caltech.edu/~rajesh/collective.html},
  keywords = {parallel I/O, mutliprocessor file system, pario-bib},
  abstract = {A majority of parallel applications obtain parallelism by
  partitioning data over multiple processors. Accessing distributed data
  structures like arrays from files often requires each processor to make a
  large number of small non-contiguous data requests. This problem can be
  addressed by replacing small non-contiguous requests by large collective
  requests. This approach, known as Collective I/O, has been found to work
  extremely well in practice. In this paper, we describe implementation and
  evaluation of a collective I/O prototype in a production parallel file system
  on the Intel Paragon. The prototype is implemented in the PFS subsystem of
  the Intel Paragon Operating System. We evaluate the collective I/O
  performance using its comparison with the PFS M_RECORD and M_UNIX I/O modes.
  It is observed that collective I/O provides significant performance
  improvement over accesses in M_UNIX mode. However, in many cases, various
  implementation overheads cause collective I/O to provide lower performance
  than the M_RECORD I/O mode.},
  comment = {This tech report was called bordawekar:collective, then renamed
  bordawekar:collective-tr, on the appearance of the ICS paper
  bordawekar:collective.}
}

@InProceedings{bordawekar:comm,
  author = {Rajesh Bordawekar and Alok Choudhary},
  title = {Communication Strategies for Out-of-core Programs on Distributed
  Memory Machines},
  booktitle = {Proceedings of the 9th ACM International Conference on
  Supercomputing},
  year = {1995},
  month = {July},
  pages = {395--403},
  publisher = {ACM Press},
  address = {Barcelona},
  earlier = {bordawekar:comm-tr},
  keywords = {parallel I/O, inter-processor communication, pario-bib},
  comment = {bordawekar:comm-tr is nearly identical in content. Also
  bordawekar:commstrat is a shorter version.}
}

@TechReport{bordawekar:comm-tr,
  author = {Rajesh Bordawekar and Alok Choudhary},
  title = {Communication Strategies for Out-of-core Programs on Distributed
  Memory Machines},
  year = {1994},
  number = {SCCS-667},
  institution = {NPAC, Syracuse University},
  later = {bordawekar:comm},
  URL =
  {http://www.npac.syr.edu/pub/by_index/sccs/papers/ps/0660/sccs-0667.ps.Z},
  keywords = {parallel I/O, inter-processor communication, pario-bib},
  abstract = {In this paper, we show that communication in the out-of-core
  distributed memory problems requires both inter-processor communication and
  file I/O. Given that primary data structures reside in files, even
  communication requires I/O. Thus, it is important to optimize the I/O costs
  associated with a communication step. We present three methods for performing
  communication in out-of-core distributed memory problems. The first method,
  termed as the "out-of-core" communication method, follows a loosely
  synchronous model. Computation and Communication phases in this case are
  clearly separated, and communication requires permutation of data in files.
  The second method, termed as "demand-driven-in-core communication" considers
  only communication required of each in-core data slab individually. The third
  method, termed as "producer-driven-in-core communication" goes even one step
  further and tries to identify the potential (future) use of data while it is
  in memory. We describe these methods in detail and provide performance
  results for out-of-core applications; namely, two-dimensional FFT and
  two-dimensional elliptic solver. Finally, we discuss how "out-of-core" and
  "in-core" communication methods could be used in virtual memory environments
  on distributed memory machines.},
  comment = {They compare different ways to do global communications in
  out-of-core applications, involving file I/O and communication at different
  times. They also comment briefly on how it would work if it depended on
  virtual memory at each node.}
}

@InProceedings{bordawekar:commstrat,
  author = {Rajesh Bordawekar and Alok Choudhary},
  title = {Communication strategies for out-of-core programs on distributed
  memory machines},
  booktitle = {Proceedings of the 1995 International Conference on High
  Performance Computing},
  year = {1995},
  month = {December},
  pages = {130--135},
  address = {New Delhi, India},
  earlier = {bordawekar:comm},
  keywords = {interprocessor communication, parallel I/O, pario-bib},
  comment = {Small version of bordawekar:comm.}
}

@Article{bordawekar:compcomm,
  author = {Rajesh Bordawekar and Alok Choudhary and J. Ramanujam},
  title = {Compilation and Communication Strategies for Out-of-core programs on
  Distributed-Memory Machines},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1996},
  month = {November},
  volume = {38},
  number = {2},
  pages = {277--288},
  publisher = {Academic Press},
  earlier = {bordawekar:compcomm-tr},
  keywords = {compiler, communication, out-of-core, parallel I/O,
  inter-processor communication, pario-bib},
  abstract = {It is widely acknowledged that improving parallel I/O performance
  is critical for widespread adoption of high performance computing. In this
  paper, we show that communication in out-of-core distributed memory problems
  may require both inter-processor communication and file I/O. Thus, in order
  to improve I/O performance, it is necessary to minimize the I/O costs
  associated with a communication step. We present three methods for performing
  communication in out-of-core distributed memory problems. The first method
  called the generalized collective communication method follows a loosely
  synchronous model; computation and communication phases are clearly
  separated, and communication requires permutation of data in files. The
  second method called the receiver-driven in-core communication considers only
  communication required of each in-core data slab individually. The third
  method called the owner-driven in-core communication goes even one step
  further and tries to identify the potential future use of data (by the
  recipients) while it is in the sender's memory. We describe these methods in
  detail and present a simple heuristic to choose a communication method from
  among the three methods. We then provide performance results for two
  out-of-core applications, the two-dimensional FFT code and the
  two-dimensional elliptic Jacobi solver. Finally, we discuss how the
  out-of-core and in-core communication methods can be used in virtual memory
  environments on distributed memory machines.}
}

@TechReport{bordawekar:compcomm-tr,
  author = {Rajesh Bordawekar and Alok Choudhary and J. Ramanujam},
  title = {Compilation and Communication Strategies for Out-of-core programs on
  Distributed Memory Machines},
  year = {1995},
  month = {November},
  number = {CACR-113},
  institution = {Scalable I/O Initiative, Center of Advanced Computing
  Research, California Insititute of Technology},
  later = {bordawekar:compcomm},
  URL = {http://www.cat.syr.edu/~rajesh/cacr113.ps},
  keywords = {out-of-core, compiler, communication, distributed memory,
  parallel I/O, pario-bib},
  abstract = {It is widely acknowledged that improving parallel I/O performance
  is critical for widespread adoption of high performance computing. In this
  paper, we show that communication in out-of-core distributed memory problems
  may require both inter-processor communication and file I/O. Thus, in order
  to improve I/O performance, it is necessary to minimize the I/O costs
  associated with a communication step. We present three methods for performing
  communication in out-of-core distributed memory problems. The first method
  called the generalized collective communication method follows a loosely
  synchronous model; computation and communication phases are clearly
  separated, and communication requires permutation of data in files. The
  second method called the receiver-driven in-core communication considers only
  communication required of each in-core data slab individually. The third
  method called the owner-driven in-core communication goes even one step
  further and tries to identify the potential future use of data (by the
  recipients) while it is in the sender's memory. We describe these methods in
  detail and present a simple heuristic to choose a communication method from
  among the three methods. We then provide performance results for two
  out-of-core applications, the two-dimensional FFT code and the
  two-dimensional elliptic Jacobi solver. Finally, we discuss how the
  out-of-core and in-core communication methods can be used in virtual memory
  environments on distributed memory machines.},
  comment = {See also bordawekar:comm, at ICS'95.}
}

@InCollection{bordawekar:compiling,
  author = {Rajesh Bordawekar and Alok Choudhary},
  title = {Issues in Compiling {I/O} Intensive Problems},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {3},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {69--96},
  publisher = {Kluwer Academic Publishers},
  keywords = {parallel I/O, compiler, out-of-core, pario-bib},
  abstract = {None.},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@TechReport{bordawekar:compositional,
  author = {Rajesh Bordawekar},
  title = {A Case for Compositional File Systems (Extended Abstract)},
  year = {1998},
  month = {March},
  number = {CACR TR-161},
  institution = {Center of Advanced Computing Research, California Insititute
  of Technology},
  URL = {http://www.cacr.caltech.edu/~rajesh/beowulf-fs.html},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  abstract = {This article presents a case for compositional file systems
  (CFSs). The CFS is designed using the end-to-end argument; the basic file
  system attributes, therefore, are independent of the user requirements. The
  CFS is designed as a functionally compositional, structurally distributed,
  and dynamically extendable file system. The article also discusses the
  advantages and implementation alternatives for these file systems, and
  outlines possible applications.}
}

@InProceedings{bordawekar:delta-fs,
  author = {Rajesh Bordawekar and Alok Choudhary and Juan Miguel Del Rosario},
  title = {An Experimental Performance Evaluation of {Touchstone Delta
  Concurrent File System}},
  booktitle = {Proceedings of the 7th ACM International Conference on
  Supercomputing},
  year = {1993},
  pages = {367--376},
  publisher = {ACM Press},
  earlier = {bordawekar:delta-fs-TR},
  URL = {ftp://erc.cat.syr.edu/ece/choudhary/PASSION/ics93.ps.Z},
  keywords = {performance evaluation, multiprocessor file system, parallel I/O,
  pario-bib},
  abstract = {For a high-performance parallel machine to be a scalable system,
  it must also have a scalable parallel I/O system. Recently, several
  commercial machines (e.g. Intel Touchstone Delta, Paragon, CM-5, Ncube-2)
  have been built that provide features for parallel I/O. However, very little
  is understood about the performance of these I/O systems. This paper presents
  an experimental evaluation of the Intel Touchstone Delta's Concurrent File
  System (CFS). The CFS utilizes the declustering of large files across the
  disks to improve the I/O performance. Data files can be read or written on
  the CFS using 4 access modes. \par We present performance measurements for
  the CFS on the Touchstone Delta with 512 compute nodes and 32 I/O nodes. The
  study focuses on file read/write rates for various configurations of I/O and
  compute nodes. The study attempts to show the effect of access modes, buffer
  sizes and volume restrictions on the system performance. The paper also shows
  that the performance of the CFS can greatly vary for various data
  distributions commonly employed in scientific and engineering applications.},
  comment = {Some new numbers over bordawekar:delta-fs-TR, but basically the
  same conclusions.}
}

@TechReport{bordawekar:delta-fs-TR,
  author = {Rajesh Bordawekar and Alok Choudhary and Juan Miguel Del Rosario},
  title = {An Experimental Performance Evaluation of {Touchstone Delta
  Concurrent File System}},
  year = {1992},
  number = {SCCS-420},
  institution = {NPAC, Syracuse University},
  later = {bordawekar:delta-fs},
  keywords = {performance evaluation, multiprocessor file system, parallel I/O,
  pario-bib},
  comment = {Evaluating the Caltech Touchstone Delta (512 nodes, 32 I/O nodes,
  64 disks, 8 MB cache per I/O node). Basic measurements of different access
  patterns and I/O modes. Location in network doesn't seem to matter.
  Throughput is often limited by the software; at least, the full hardware
  throughputs are rarely obtained. Sometimes they are compnode-limited, and
  other times they may be being limited by the cache management. There must be
  a way to push bottleneck back to the disks .}
}

@TechReport{bordawekar:efficient,
  author = {Rajesh Bordawekar and Rajeev Thakur and Alok Choudhary},
  title = {Efficient Compilation of Out-of-core Data Parallel Programs},
  year = {1994},
  month = {April},
  number = {SCCS-622},
  institution = {NPAC},
  later = {bordawekar:reorganize},
  URL = {ftp://erc.cat.syr.edu/ece/choudhary/PASSION/access_reorg.ps.Z},
  keywords = {parallel I/O, compiler, pario-bib},
  abstract = {Large scale scientific applications, such as the Grand Challenge
  applications, deal with very large quantities of data. The amount of main
  memory in distributed memory machines is usually not large enough to solve
  problems of realistic size. This limitation results in the need for system
  and application software support to provide efficient parallel I/O for
  out-of-core programs. This paper describes techniques for translating
  out-of-core programs written in a data parallel language like HPF to message
  passing node programs with explicit parallel I/O. We describe the basic
  compilation model and various steps involved in the compilation. The
  compilation process is explained with the help of an out-of-core matrix
  multiplication program. We first discuss how an out-of-core program can be
  translated by extending the method used for translating in-core programs. We
  then describe how the compiler can optimize the code by estimating the I/O
  costs associated with different array access patterns and selecting the
  method with the least I/O cost. This optimization can reduce the amount of
  I/O by as much as an order of magnitude. Performance results on the Intel
  Touchstone Delta are presented and analyzed.},
  comment = {Revised as bordawekar: This is actually fairly different from
  thakur:runtime. They describe the same basic compiler technique, where arrays
  are distributed across processors, and each processor has a local array file
  for holding data from its local partitions. Then the I/O needed for a loop is
  broken into slabs, where the program proceeds as an alternation of (read
  slabs, compute, write slabs). The big new thing here is that the compiler
  tries different ways to form slabs (e.g., by row or by column), estimates the
  number of I/Os and the amount of data moved for each case, and chooses the
  case with the smallest amount of I/O. They also mention how the choice of
  memory size allocated to different arrays affects the amount of IO, but give
  no algorithm other than "try all the possibilities."}
}

@Article{bordawekar:exemplar,
  author = {Rajesh Bordawekar and Steven Landherr and Don Capps and Mark
  Davis},
  title = {Experimental Evaluation of the {Hewlett-Packard Exemplar} File
  System},
  journal = {ACM SIGMETRICS Performance Evaluation Review},
  year = {1997},
  month = {December},
  volume = {25},
  number = {3},
  pages = {21--28},
  earlier = {bordawekar:exemplar-tr2},
  later = {bordawekar:jexemplar},
  URL = {http://doi.acm.org/10.1145/270900.270904},
  keywords = {multiprocessor file system, performance evaluation, parallel I/O,
  pario-bib},
  comment = {Part of a special issue on parallel and distributed I/O.}
}

@TechReport{bordawekar:exemplar-tr2,
  author = {Rajesh Bordawekar},
  title = {Quantitative Characterization and Analysis of the {I/O} Behavior of
  a Commercial Distributed-shared-memory Machine},
  year = {1998},
  month = {March},
  number = {CACR 157},
  institution = {Center of Advanced Computing Research, California Insititute
  of Technology},
  later = {bordawekar:exemplar},
  URL = {http://www.cacr.caltech.edu/~rajesh/exemplar1.html},
  keywords = {parallel I/O, pario-bib, workload characterization, distributed
  shared memory},
  abstract = {This paper presents a unified evaluation of the I/O behavior of a
  commercial clustered DSM machine, the HP Exemplar. Our study has the
  following objectives: (1) To evaluate the impact of different interacting
  system components, namely, architecture, operating system, and programming
  model, on the overall I/O behavior and identify possible performance
  bottlenecks and (2) To provide hints to the users for achieving high
  out-of-box I/O throughput. We find that for the DSM machines that are built
  as a cluster of SMP nodes, integrated clustering of computing and I/O
  resources, both hardware and software, is not advantageous for two reasons.
  First, within an SMP node, the I/O bandwidth is often restricted by the
  performance of the peripheral components and cannot match the memory
  bandwidth. Second, since the I/O resources are shared as a global resource,
  the file-access costs become non-uniform and the I/O behavior of the entire
  system, in terms of the scalability and balance, degrades. We observe that
  the buffered I/O performance is determined not only by the I/O subsystem, but
  also by the programming model, global-shared memory subsystem, and
  data-communication mechanism. Moreover, programming-model support can be
  effectively used to overcome the performance constraints created by the
  architecture and operating system. For example, on the HP Exemplar, users can
  achieve high I/O throughput by using features of the programming model that
  balance the sharing and locality of the user buffers and file systems.
  Finally, we believe that at present, the I/O subsystems are being designed in
  isolation and there is a need for mending the traditional memory-oriented
  design approach to address this problem.}
}

@TechReport{bordawekar:framework,
  author = {Rajesh Bordawekar and Alok Choudhary},
  title = {A Framework for Representing Data Parallel Programs and its
  Application in Program Reordering},
  year = {1995},
  month = {March},
  number = {SCCS-698},
  institution = {NPAC, Syracuse University},
  URL = {http://www.npac.syr.edu/techreports/html/0650/abs-0698.html},
  keywords = {data parallel, parallel I/O, pario-bib},
  comment = {Although this is mostly a compilers paper, there is a little bit
  about parallel I/O here. They comment briefly on how their compiler framework
  will help them make a compiler that can provide advice to the file system
  about prefetching and cache replacement, and to decide on the layout of
  scratch files to optimize locality.}
}

@TechReport{bordawekar:hpf,
  author = {Rajesh Bordawekar and Alok Choudhary},
  title = {{HPF} with Parallel {I/O} Extensions},
  year = {1993},
  number = {SCCS-613},
  institution = {NPAC, Syracuse University},
  URL = {http://www.npac.syr.edu/techreports/ps/0600/sccs-0613.ps.Z},
  keywords = {parallel I/O, pario-bib},
  comment = {They propose some extensions to HPF to accomodate parallel I/O.}
}

@TechReport{bordawekar:hpfio,
  author = {Rajesh Bordawekar and Alok Choudhary},
  title = {Extending {I/O} Capabilities of {High Performance Fortran}: Initial
  Experiences},
  year = {1995},
  month = {December},
  number = {CACR-115},
  institution = {Scalable I/O Initiative, Center of Advanced Computing
  Research, California Insititute of Technology},
  keywords = {parallel I/O, compiler, FORTRAN, HPF, pario-bib},
  abstract = {This report presents implementation details of the prototype
  PASSION compiler. The PASSION compiler provides support for: (1) Accessing
  multidimensional in-core arrays and (2) Out-of-core computations. The PASSION
  compiler takes as input an annotated I/O intensive (either an out-of-core
  program or program accessing distributed arrays from files) High Performance
  Fortran (HPF) program. Using hints provided by the user, the compiler
  modifies the computation so as to minimize the I/O cost and restructures the
  program to incorporate explicit I/O calls. In this report, compilation of
  out-of-core FORALL constructs is illustrated using representative programs.
  Compiler support for accessing distributed in-core data is explained using
  illustrative examples and supplemented by experimental results.},
  comment = {Currently not available on WWW. Describes implementation details
  of the PASSION Compiler.}
}

@Article{bordawekar:jexemplar,
  author = {Rajesh Bordawekar},
  title = {Quantitative Characterization and Analysis of the {I/O} Behavior of
  a Commercial Distributed-shared-memory Machine},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {2000},
  month = {May},
  volume = {11},
  number = {5},
  pages = {509--526},
  earlier = {bordawekar:exemplar},
  URL = {http://www.computer.org/tpds/td2000/l0509abs.htm},
  keywords = {parallel I/O, pario-bib, workload characterization, distributed
  shared memory},
  abstract = {This paper presents a unified evaluation of the I/O behavior of a
  commercial clustered DSM machine, the HP Exemplar. Our study has the
  following objectives: 1) To evaluate the impact of different interacting
  system components, namely, architecture, operating system, and programming
  model, on the overall I/O behavior and identify possible performance
  bottlenecks, and 2) To provide hints to the users for achieving high
  out-of-box I/O throughput. We find that for the DSM machines that are built
  as a cluster of SMP nodes, integrated clustering of computing and I/O
  resources, both hardware and software, is not advantageous for two reasons.
  First, within an SMP node, the I/O bandwidth is often restricted by the
  performance of the peripheral components and cannot match the memory
  bandwidth. Second, since the I/O resources are shared as a global resource,
  the file-access costs become nonuniform and the I/O behavior of the entire
  system, in terms of both scalability and balance, degrades. \par We observe
  that the buffered I/O performance is determined not only by the I/O
  subsystem, but also by the programming model, global-shared memory subsystem,
  and data-communication mechanism. Moreover, programming-model support can be
  used effectively to overcome the performance constraints created by the
  architecture and operating system. For example, on the HP Exemplar, users can
  achieve high I/O throughput by using features of the programming model that
  balance the sharing and locality of the user buffers and file systems.
  Finally, we believe that at present, the I/O subsystems are being designed in
  isolation, and there is a need for mending the traditional memory-oriented
  design approach to address this problem.}
}

@InProceedings{bordawekar:model,
  author = {Rajesh Bordawekar and Alok Choudhary and Ken Kennedy and Charles
  Koelbel and Michael Paleczny},
  title = {A Model and Compilation Strategy for Out-of-core Data Parallel
  Programs},
  booktitle = {Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and
  Practice of Parallel Programming},
  year = {1995},
  month = {July},
  pages = {1--10},
  publisher = {ACM Press},
  address = {Santa Barbara, CA},
  note = {Also available as the following technical reports: NPAC Technical
  Report SCCS-0696, CRPC Technical Report CRPC-TR94507-S, SIO Technical Report
  CACR SIO-104},
  earlier = {bordawekar:model-tr},
  URL = {http://www.cacr.caltech.edu/techpubs/PAPERS/cacr104.ps},
  keywords = {parallel I/O, compiler, pario-bib},
  abstract = {It is widely acknowledged in high-performance computing circles
  that parallel input/output needs substantial improvement in order to make
  scalable computers truly usable. We present a data storage model that allows
  processors independent access to their own data and a corresponding
  compilation strategy that integrates data-parallel computation with data
  distribution for out-of-core problems. Our results compare several
  communication methods and I/O optimizations using two out-of-core problems,
  Jacobi iteration and LU factorization.}
}

@TechReport{bordawekar:model-tr,
  author = {Rajesh Bordawekar and Alok Choudhary and Ken Kennedy and Charles
  Koebel and Mike Paleczny},
  title = {A Model and Compilation Strategy for Out-of-Core Data Parallel
  Programs},
  year = {1994},
  month = {December},
  number = {CRPC-TR94507-S},
  institution = {CRPC},
  later = {bordawekar:model},
  URL =
  {gopher://softlib.rice.edu/99/softlib/CRPC-TRs/reports/CRPC-TR94507-S.ps},
  keywords = {compilers, parallel I/O, out-of-core applications, pario-bib},
  comment = {Basically a summary of their I/O and compilation model for
  out-of-core compilation of HPF programs. See also paleczny:support.}
}

@MastersThesis{bordawekar:msthesis,
  author = {Rajesh R. Bordawekar},
  title = {Issues in Software Support for Parallel {I/O}},
  year = {1993},
  month = {May},
  school = {Syracuse University},
  URL = {ftp://erc.cat.syr.edu/ece/choudhary/PASSION/msthesis.ps.Z},
  keywords = {parallel I/O, pario-bib},
  abstract = {This thesis looks at various issues in providing
  application-level software support for parallel I/O. We show that the
  performance of the parallel I/O system varies greatly as a function of data
  distributions. We present runtime I/O primitives for parallel languages which
  allow the user to obtain a consistent performance over a wide range of data
  distributions. \par In order to design these primitives, we study various
  parameters used in the design of a parallel file system. We evaluate the
  performance of Touchstone Delta Concurrent File System and study the effect
  of parameters like number of processors, number of disks, file size on the
  system performance. We compute the I/O costs for common data distributions.
  We propose an alternative strategy -two phase data access strategy- to
  optimize the I/O costs connected with data distributions. We implement
  runtime primitives using the two-phase access strategy and show that using
  these primitives not only I/O access rates are improved but also user can
  obtain complex data distributions like block-block and block-cyclic.},
  comment = {This is basically a consolidation of the other bordawekar papers,
  in more detail. So he covers an experimental analysis of the touchstone
  delta; of the problems arising from the direct-access model for
  non-conforming distributions; of the two-phase model; and of the run-time
  library to support two-phase access. See also bordawekar:reorganize,
  thakur:runtime, bordawekar:efficient, thakur:out-of-core,
  delrosario:two-phase, bordawekar:primitives, bordawekar:delta-fs.}
}

@InProceedings{bordawekar:placement,
  author = {Rajesh Bordawekar and Alok Choudhary and J. Ramanujam},
  title = {A Framework for Integrated Communication and {I/O} Placement},
  booktitle = {Proceedings of the 2nd International Euro-Par'96, Parallel
  Processing},
  year = {1996},
  month = {August},
  series = {Lecture Notes in Computer Science},
  volume = {1124},
  pages = {541--552},
  publisher = {Springer-Verlag},
  earlier = {bordawekar:placement-tr},
  URL = {http://www.cacr.caltech.edu/~rajesh/europar-rajesh.ps},
  keywords = {parallel I/O, compiler, pario-bib},
  abstract = {This paper describes a framework for analyzing dataflow within an
  out-of-core parallel program. Dataflow properties of FORALL statement are
  analyzed and a unified I/O and communication placement framework is
  presented. This placement framework can be applied to many problems, which
  include eliminating redudant I/O incurred in communication. The framework is
  validated by applying it for optimizing I/O and communication in out-of-core
  stencil problems. Experimental performance results on an Intel Paragon show
  significant reduction in I/O and communication overhead.}
}

@TechReport{bordawekar:placement-tr,
  author = {Rajesh Bordawekar and Alok Choudhary and J. Ramanujam},
  title = {A Framework for Integrated Communication and {I/O} Placement},
  year = {1996},
  month = {February},
  number = {CACR-117},
  institution = {Scalable I/O Initiative, Center of Advanced Computing
  Research, California Insititute of Technology},
  later = {bordawekar:placement},
  URL = {http://www.cacr.caltech.edu/~rajesh/cacr117.ps},
  keywords = {parallel I/O, compiler, pario-bib},
  abstract = {In this paper, we describe a framework for optimizing
  communication and I/O costs in out-of-core problems. We focus on
  communication and I/O optimization within a FORALL construct. We show that
  existing frameworks do not extend directly to out-of-core problems and can
  not exploit the FORALL semantics. We present a unified framework for the
  placement of I/O and communication calls and apply it for optimizing
  communication for stencil applications. Using the experimental results, we
  demonstrate that correct placement of I/O and communication calls can
  completely eliminate extra file I/O from communication and obtain significant
  performance improvement.}
}

@InProceedings{bordawekar:primitives,
  author = {Rajesh Bordawekar and Juan Miguel {del Rosario} and Alok
  Choudhary},
  title = {Design and Evaluation of Primitives for Parallel {I/O}},
  booktitle = {Proceedings of Supercomputing '93},
  year = {1993},
  pages = {452--461},
  publisher = {IEEE Computer Society Press},
  address = {Portland, OR},
  URL = {ftp://erc.cat.syr.edu/ece/choudhary/PASSION/sc93.ps.Z},
  keywords = {parallel I/O, pario-bib},
  abstract = {In this paper, we show that the performance of parallel file
  systems can vary greatly as a function of the selected data distributions,
  and that some data distributions can not be supported. Also, we describe how
  the parallel language extensions, though simplifying the programming, do not
  address the performance problems found in parallel file systems. \par We have
  devised an alternative scheme for conducting parallel I/O - the Two-Phase
  Access Strategy - which guarantees higher and more consistent performance
  over a wider spectrum of data distributions. We have designed and implemented
  runtime primitives that make use of the two-phase access strategy to conduct
  parallel I/O, and facilitate the programming of parallel I/O operations. We
  describe these primitives in detail and provide performance results which
  show that I/O access rates are improved by up to several orders of magnitude.
  Further, we show that the variation in performance over various data
  distributions is restricted to within a factor of 2 of the best access
  rate.},
  comment = {Much of this is the same as delrosario:two-phase, except for
  section~4 where they describe their actual run-time library of primitives,
  with a little bit about how it works. It's not clear, for example, how their
  meta-data structures are distributed across the machine. They also do not
  describe their methods for the data redistribution.}
}

@TechReport{bordawekar:reorganize,
  author = {Rajesh Bordawekar and Alok Choudhary and Rajeev Thakur},
  title = {Data Access Reorganizations in Compiling Out-of-core Data Parallel
  Programs on Distributed Memory Machines},
  year = {1994},
  month = {September},
  number = {SCCS-622},
  institution = {NPAC},
  address = {Syracuse, NY 13244},
  earlier = {bordawekar:efficient},
  URL = {ftp://erc.cat.syr.edu/ece/choudhary/PASSION/access_reorg.ps.Z},
  keywords = {parallel I/O, compilation, pario-bib},
  comment = {Basically they give a case study of out-of-core matrix
  multiplication to emphasize that the compiler's choice of loop ordering and
  matrix distribution for in-core matmult is not a very good choice for
  out-of-core matmult, because it causes too much I/O. By reorganizing the data
  and the loops, they get much better performance. In this particular case
  there are known algorithms which they should have used. In general they make
  the point that the compiler should consider several organizations, and
  estimate their costs, before generating code. They don't propose anything
  more sophisticated than to try all the possible organizations.}
}

@InProceedings{bordawekar:stencil,
  author = {Rajesh Bordawekar and Alok Choudhary and J. Ramanujam},
  title = {Automatic Optimization of Communication in Compiling Out-of-core
  Stencil Codes},
  booktitle = {Proceedings of the 10th ACM International Conference on
  Supercomputing},
  year = {1996},
  month = {May},
  pages = {366--373},
  publisher = {ACM Press},
  address = {Philadelphia, PA},
  earlier = {bordawekar:stencil-tr},
  URL = {http://www.cat.syr.edu/~rajesh/ics96.ps},
  keywords = {compiler, parallel I/O, pario-bib},
  abstract = {In this paper, we describe a technique for optimizing
  commununication for out-of-core distributed memory stencil problems. In these
  problems, communication may require both inter-processor communication and
  file I/O. We show that in certain cases, extra file I/O incurred in
  communication can be completely eliminated by reordering in-core
  computations. The in-core computation pattern is decided by: (1) how the
  out-of-core data distributed into in-core slabs (tiling) and (2) how the
  slabs are accessed. We show that a compiler using the stencil and processor
  information can choose the tiling parameters and schedule the tile accesses
  so that the extra file I/O is eliminated and overall performance is
  improved.}
}

@TechReport{bordawekar:stencil-tr,
  author = {Rajesh Bordawekar and Alok Choudhary and J. Ramanujam},
  title = {Automatic Optimization of Communication in Out-of-core Stencil
  Codes},
  year = {1995},
  month = {November},
  number = {CACR-114},
  institution = {Scalable I/O Initiative, Center of Advanced Computing
  Research, California Insititute of Technology},
  later = {bordawekar:stencil},
  keywords = {compiler, parallel I/O, pario-bib},
  abstract = {In this paper, we describe a technique for optimizing
  commununication for out-of-core distributed memory stencil problems. In these
  problems, communication may require both inter-processor communication and
  file I/O. We show that in certain cases, extra file I/O incurred in
  communication can be completely eliminated by reordering in-core
  computations. The in-core computation pattern is decided by: (1) how the
  out-of-core data distributed into in-core slabs (tiling) and (2) how the
  slabs are accessed. We show that a compiler using the stencil and processor
  information can choose the tiling parameters and schedule the tile accesses
  so that the extra file I/O is eliminated and overall performance is
  improved.}
}

@InProceedings{bordawekar:support,
  author = {Rajesh Bordawekar and Alok Choudhary},
  title = {Compiler and Runtime Support For Parallel {I/O}},
  booktitle = {Proceedings of IFIP Working Conference (WG10.3) on Programming
  Environments for Massively Parallel Distributed Systems},
  year = {1994},
  month = {April},
  publisher = {Birkhaeuser Verlag AG, Basel, Switzerland},
  address = {Monte Verita, Ascona, Switzerland},
  keywords = {parallel I/O, pario-bib},
  comment = {Contains much of the material from bordawekar:hpf.}
}

@PhdThesis{bordawekar:thesis,
  author = {Rajesh Bordawekar},
  title = {Techniques for Compiling {I/O} Intensive Parallel Programs},
  year = {1996},
  month = {April},
  school = {Electrical and Computer Engineering Dept., Syracuse University},
  note = {Also available as Caltech technical report CACR-118},
  URL = {http://www.cat.syr.edu/~rajesh/thesis.html},
  keywords = {parallel I/O, compiler, HPF, pario-bib},
  abstract = {This dissertation investigates several issues in providing
  compiler support for I/O intensive parallel programs. In this dissertation,
  we focus on satisfying two I/O requirements, namely, support for accessing
  multidimensional arrays and support for {\it out-of-core} computations. We
  analyze working spaces in I/O intensive programs and propose three execution
  models to be used by users or compilers for developing efficient I/O
  intensive parallel programs. Different phases in compiling out-of-core
  parallel programs are then described. Three different methods for performing
  communication are presented and validated using representative application
  templates.We illustrate that communication in out-of-core programs may
  require both inter-processor communication and file I/O. We show that using
  the {\it copy-in-copy-out} semantics of the HPF {\tt FORALL} construct, extra
  file I/O incurred in communication can be completely eliminated by reordering
  in-core computations. Two different approaches for reordering in-core
  computations are presented, namely, integrated tiling and scheduling
  heuristic, and dataflow framework for placing communication and I/O calls.
  The discussion is supplemented with experimental performance results of
  representative stencil applications. Finally, an overview of the prototype
  \textsf{PASSION} (Parallel And Scalable Software for I/O) compiler is
  presented. This compiler takes an annotated out-of-core High Performance
  Fortran (HPF) program as input and generates the corresponding {\it
  node+message-passing} program with calls to the parallel I/O runtime library.
  We illustrate various functionalities of the compiler using example programs
  and supplement them by experimental results.}
}

@InProceedings{bornstein:reshuffle,
  author = {C. Bornstein and P. Steenkiste},
  title = {Data Reshuffling in Support of Fast {I/O} For Distributed-Memory
  Machines},
  booktitle = {Proceedings of the Third IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1994},
  month = {August},
  pages = {227--235},
  keywords = {parallel I/O, distributed memory, pario-bib},
  comment = {In a sense, this is about a two-phase technique for network I/O.
  They consider the problem of feeding a fast network interface (HIPPI) from a
  distributed-memory parallel machine (iWARP) in which the individual internal
  links are slower than the external network. So they get the processors to
  cooperate to reshuffle the data into a canonical layout that is convenient to
  send to the gateway node, and from there onto the external network.}
}

@Misc{braam:lustre-arch,
  author = {Peter J. Braam},
  title = {The Lustre Storage Architecture},
  year = {2002},
  month = {November},
  howpublished = {Cluster File Systems Inc. Architecture, design, and manual
  for Lustre},
  note = {http://www.lustre.org/docs/lustre.pdf},
  URL = {http://www.lustre.org/docs/lustre.pdf},
  keywords = {object-based storage, distributed file system, parallel file
  system, pario-bib},
  comment = {Describes an open-source project to develop an object-based file
  system for clusters. Related to the NASD project at CMU
  (http://www.pdl.cs.cmu.edu/NASD/).}
}

@InProceedings{bradley:ipsc2io,
  author = {David K. Bradley and Daniel A. Reed},
  title = {Performance of the {Intel iPSC/2} Input/Output System},
  booktitle = {Proceedings of the Fourth Conference on Hypercube Concurrent
  Computers and Applications},
  year = {1989},
  pages = {141--144},
  publisher = {Golden Gate Enterprises, Los Altos, CA},
  address = {Monterey, CA},
  keywords = {hypercube, parallel I/O, Intel, pario-bib},
  comment = {Some measurements and simulations of early CFS performance. Looks
  terrible, but they disclaim that it is a beta version of the first CFS. They
  determined that the disks are the bottleneck. But this may just imply that
  they need more disks. Their parallel synthetic applications had each process
  read a separate file. CFS had ridiculous traffic overhead. Again, this was
  beta CFS.}
}

@TechReport{brandwijn:dasd,
  author = {Alexandre Brandwajn},
  title = {Performance Benefits of Parallelism in Cached {DASD} Controllers},
  year = {1988},
  month = {November},
  number = {UCSC-CRL-88-30},
  institution = {Computer Research Laboratory, UC Santa Cruz},
  keywords = {parallel I/O, disk caching, disk architecture, pario-bib},
  comment = {Some new DASD products with caches overlap cache hits with
  prefetch of remainder of track into cache. They use analytical model to
  evaluate performance of these. They find performance improvements of 5-15
  percent under their assumptions.}
}

@InProceedings{brezany:HPF,
  author = {Peter Brezany and Michael Gernt and Piyush Mehotra and Hans Zima},
  title = {Concurrent File Operations in a {High Performance FORTRAN}},
  booktitle = {Proceedings of Supercomputing '92},
  year = {1992},
  pages = {230--237},
  keywords = {supercomputing, fortran, multiprocessor file system interface,
  pario-bib},
  comment = {Describing their way of writing arrays to files so that they are
  written in a fast, parallel way, and so that (if read in same distribution)
  they can be read fast and parallel. Normal read and write forces standard
  ordering, but cread and cwrite uses a compiler and runtime selected ordering,
  which is stored in the file so it can be used when rereading. Good for temp
  files.}
}

@InProceedings{brezany:HPF2,
  author = {Peter Brezany and Jonghyun Lee and Marianne Winslett},
  title = {Parallel {I/O} Support for HPF on Computational Grids},
  booktitle = {Proceedings of the Fourth International Symposium on High
  Performance Computing},
  year = {2002},
  month = {May},
  series = {Lecture Notes in Computer Science},
  volume = {2327},
  pages = {539--550},
  publisher = {Springer-Verlag},
  URL =
  {http://link.springer.de/link/service/series/0558/bibs/2327/23270539.htm},
  keywords = {parallel I/O, Fortran, HPF, data-parallel, computational grid,
  pario-bib},
  abstract = {Recently several projects have started to implement large-scale
  high-performance computing on "computational grids" composed of heterogeneous
  and geographically distributed systems of computers, networks, and storage
  devices that collectively act as a single "virtual supercomuter". One of the
  great challenges for this environment is to provide appropriate high-level
  programming models. High Performance Fortran (HPF) is a language of choice
  for development of data parallel components of Grid applications. Another
  challenge is to provide efficient access to data that is distributed across
  local and remote Grid resources. In this paper, constructs to specify
  parallel input and output (I/O) operations on multidimensional arrays on the
  Grid in the context of HPF are proposed. The paper also presents
  implementation concepts that are based on the HPF compiler VFC, the parallel
  I/O runtime system Panda, Internet, and Grid technologies. Preliminary
  experimental performance results are discussed in the context of a real
  application example.}
}

@InProceedings{brezany:architecture,
  author = {Peter Brezany and Thomas A. Mueck and Erich Schikuta},
  title = {A Software Architecture for Massively Parallel Input-Output},
  booktitle = {Third International Workshop PARA'96 (Applied Parallel Computing
  - Industrial Computation and Optimization)},
  year = {1996},
  month = {August},
  series = {Lecture Notes in Computer Science},
  volume = {1186},
  pages = {85--96},
  publisher = {Springer-Verlag},
  address = {Lyngby, Denmark},
  note = {Also available as Technical Report of the Inst. f.~Angewandte
  Informatik u. Informationssysteme, University of Vienna, TR~96202},
  URL = {http://www.pri.univie.ac.at/~schiki/research/paper/para96/para96.ps},
  keywords = {compiler transformations, runtime support, parallel I/O,
  prefetching, pario-bib},
  abstract = {For an increasing number of data intensive scientific
  applications, parallel I/O concepts are a major performance issue. Tackling
  this issue, we provide an outline of an input/output system designed for
  highly efficient, scalable and conveniently usable parallel I/O on
  distributed memory systems. The main focus of this paper is the parallel I/O
  runtime system support provided for software-generated programs produced by
  parallelizing compilers in the context of High Performance FORTRAN efforts.
  Specifically, our design is presented in the context of the Vienna Fortran
  Compilation System.}
}

@InProceedings{brezany:compiling,
  author = {Peter Brezany and Thomas A. Mueck and Erich Schikuta},
  title = {Mass Storage Support for a Parallelizing Compilation System},
  booktitle = {International Conference Eurosim'96-- HPCN challenges in
  Telecomp and Telecom: Parallel Simulation of Complex Systems and Large Scale
  Applications},
  year = {1996},
  month = {June},
  pages = {63--70},
  publisher = {North-Holland, Elsevier Science},
  address = {Delft, The Netherlands},
  URL =
  {http://www.pri.univie.ac.at/~schiki/research/paper/eurosim96/eurosim96.ps},
  keywords = {parallel I/O, high performance mass storage system, high
  performance languages, compilation techniques, data administration,
  pario-bib}
}

@InProceedings{brezany:io-support,
  author = {Peter Brezany and Thomas A. Mueck and Erich Schikuta},
  title = {Language, Compiler and Parallel Database Support for {I/O} Intensive
  Applications},
  booktitle = {Proceedings of the International Conference on High Performance
  Computing and Networking},
  year = {1995},
  month = {May},
  series = {Lecture Notes in Computer Science},
  volume = {919},
  pages = {14--20},
  publisher = {Springer-Verlag},
  address = {Milan, Italy},
  note = {also available as Technical Report of the Inst. f.~Software
  Technology and Parallel Systems, University of Vienna, TR95-8, 1995},
  URL = {http://www.pri.univie.ac.at/~schiki/research/paper/techrep/tr95-8.ps},
  keywords = {compiler transformations, runtime support, declustering, parallel
  I/O, pario-bib},
  comment = {They describe some extensions to Vienna Fortran that support
  parallel I/O, and how they plan to extend the compiler and run-time system to
  help. They are somewhat short on details, however. The basic idea is that
  file declustering is based on hints from the compiler or programmer about how
  the file will be used, eg, as a matrix distributed in thus-and-so way.}
}

@Article{brezany:irregular,
  author = {P. Brezany and A. Choudhary and M. Dang},
  title = {Parallelization of irregular out-of-core applications for
  distributed-memory systems},
  journal = {High-Performance Computing and Networking},
  year = {1997},
  series = {Lecture Notes in Computer Science},
  volume = {1225},
  pages = {811--820},
  publisher = {Springer-Verlag},
  earlier = {brezany:irregular-tr},
  keywords = {parallel I/O, out of core, compiler, library, pario-bib},
  abstract = {Large scale irregular applications involve data arrays and other
  data structures that are too large to fit in main memory and hence reside on
  disks; such applications are called out-of-core applications. This paper
  presents techniques for implementing this kind of applications. In particular
  we present a design for a runtime system to efficiently support parallel
  execution of irregular out-of-core codes on distributed-memory systems.
  Furthermore, we describe the appropriate program transformations required to
  reduce the I/O overheads for staging data as well as for communication while
  maintaining load balance. The proposed techniques can be used by a
  parallelizing compiler or by users writing programs in node + message passing
  style. We have done a preliminary implementation of the techniques presented
  here. We introduce experimental results from a template CFD code to
  demonstrate the efficacy of the presented techniques.},
  comment = {The authors present techniques for implementing large scale
  irregular out-of-core applications. The techniques they describe can be used
  by a parallel compiler (e.g., HPF and its extensions) or by users using
  message passing. The objectives of the proposed techniques are to ''to
  minimize I/O accesses in all steps while maintaining load balance and minimal
  communication''. They demonstrate the effectivness of their techniques by
  showing results from a Computational Fluid Dynamics (CFD) code.}
}

@TechReport{brezany:irregular-tr,
  author = {P. Brezany and A. Choudhary},
  title = {Techniques and Optimizations for Developing Irregular Out-of-Core
  Applications on Distributed-Memory Systems},
  year = {1996},
  month = {November},
  number = {96-4},
  institution = {Institute for Software Technology and Parallel Systems,
  University of Vienna},
  URL =
  {http://www.pri.univie.ac.at/~schiki/research/vipios/paper/brezany-choudhary.ps},
  keywords = {parallel I/O, out of core, irregular applications, compiler,
  pario-bib}
}

@InProceedings{brezany:technology,
  author = {Peter Brezany and Marianne Winslett and Denis A. Nicole and Toni
  Cortes},
  title = {Parallel {I/O} and Storage Technology},
  booktitle = {Proceedings of the Seventh International Euro-Par Conference},
  year = {2001},
  month = {August},
  series = {Lecture Notes in Computer Science},
  volume = {2150},
  pages = {887--888},
  publisher = {Springer-Verlag},
  address = {Manchester, UK},
  URL =
  {http://link.springer.de/link/service/series/0558/bibs/2150/21500887.htm},
  keywords = {pario-bib, parallel I/O},
  abstract = {Input and output (I/O) is a major performance bottleneck for
  large-scale scientific applications running on parallel platforms. For
  example, it is not uncommon that performance of carefully tuned parallel
  programs can slow dramatically when they read or write files. This is because
  many parallel applications need to access large amounts of data, and although
  great advances have been made in the CPU and communication performance of
  parallel machines, similar advances have not been made in their I/O
  performance. The densities and capacities of disks have increased
  significantly, but improvement in performance of individual disks has not
  followed the same pace. For parallel computers to be truly usable for solving
  real, large-scale problems, the I/O performance must be scalable and balanced
  with respect to the CPU and communication performance of the system. Parallel
  I/O techniques can help to solve this problem by creating multiple data paths
  between memory and disks. However, simply adding disk drives to an I/O system
  without considering the overall software design will improve performance only
  marginally.}
}

@InProceedings{broom:acacia,
  author = {Bradley M. Broom},
  title = {A Synchronous File Server for Distributed File Systems},
  booktitle = {Proceedings of the 16th Australian Computer Science Conference},
  year = {1993},
  earlier = {broom:acacia-tr},
  keywords = {distributed file system, pario-bib},
  comment = {See broom:acacia-tr. See also broom:impl, lautenbach:pfs,
  mutisya:cache, and broom:cap.}
}

@TechReport{broom:acacia-tr,
  author = {Bradley M. Broom},
  title = {A Synchronous File Server for Distributed File Systems},
  year = {1992},
  month = {August},
  number = {TR--CS--92--12},
  institution = {Dept. of Computer Science, Australian National University},
  later = {broom:acacia},
  keywords = {distributed file system, pario-bib},
  comment = {This paper is not specifically about parallel I/O, but the file
  system will be used in the AP-1000 multiprocessor. Acacia is a file server
  that is optimized for synchronous writes, like those used in stateless
  protocols (eg, NFS). It writes inodes in blocks in any free location that is
  close to the current head position, using indirect inode blocks to track
  those. Indirect blocks are in turn written anywhere convenient, and their
  positions are tracked by the superblock. There is one slot in each cylinder
  reserved for the superblock, which is timestamped. They get good performance
  but claim to need a better implementation, and a faster allocation algorithm.
  No indication of effect on read performance.}
}

@InProceedings{broom:cap,
  author = {Bradley M. Broom and Robert Cohen},
  title = {Acacia: A Distributed, Parallel File System for the {CAP-II}},
  booktitle = {Proceedings of the First Fujitsu-ANU CAP Workshop},
  year = {1990},
  month = {November},
  keywords = {distributed file system, multiprocessor file system, pario-bib},
  comment = {See also broom:acacia, broom:impl, lautenbach:pfs, and
  mutisya:cache. This describes the semantic model for their file system.
  Modelled a lot after Amoeba, they have capabilities that represent immutable
  files. There are create, destroy, read, and write operations, but the read
  and write can affect only part of the file, if desired. They also have an
  atomic ``copy'' operation, which creates a snapshot of the current state of
  the file. They also have ``spawn'' and ``merge'' operations, which are
  essentially begin and end a transaction, a set of changes that are atomically
  merged into the file later. These seem to be addressing issues of concurrency
  more than of parallelism. They also discuss implementation somewhat,
  mentioning the use of distributed caches and log-structured disk layout.
  Prototype in Linda (!).}
}

@InProceedings{broom:impl,
  author = {Bradley M. Broom},
  title = {Implementation and Performance of the {Acacia} File System},
  booktitle = {Proceedings of the Second Fujitsu-ANU CAP Workshop},
  year = {1991},
  month = {November},
  keywords = {distributed file system, multiprocessor file system, pario-bib},
  comment = {See also broom:acacia, lautenbach:pfs, mutisya:cache, and
  broom:cap. This paper is a very sketchy overview of those; it is better to
  read them.}
}

@InProceedings{broom:kelpio,
  author = {Bradley Broom and Rob Fowler and Ken Kennedy},
  title = {{KelpIO}: A telescope-ready domain-specific {I/O} library for
  irregular block-structured applications},
  booktitle = {Proceedings of the First IEEE/ACM International Symposium on
  Cluster Computing and the Grid},
  year = {2001},
  month = {May},
  pages = {148--155},
  publisher = {IEEE Computer Society Press},
  address = {Brisbane, Australia},
  URL = {http://ieeexplore.ieee.org/iel5/7358/19961/00923187.pdf},
  keywords = {parallel I/O, domain-specific I/O library, scientific computing,
  astronomy, pario-bib},
  abstract = {To ameliorate the need to spend significant programmer time
  modifying parallel programs to achieve high-performance, while maintaining
  compact, comprehensible source codes, the paper advocates the use of
  telescoping language technology to automatically apply, during the normal
  compilation process, high-level performance enhancing transformations to
  applications using a high-level domain-specific I/O library. We believe that
  this approach will be more acceptable to application developers than new
  language extensions, but will be just as amenable to optimization by advanced
  compilers, effectively making it a domain-specific language extension for
  I/O. The paper describes a domain-specific I/O library for irregular
  block-structured applications based on the KeLP library, describes high-level
  transformations of the library primitives for improving performance, and
  describes how a high-level domain-specific optimizer for applying these
  transformations could be constructed rising the telescoping languages
  framework.}
}

@InProceedings{broom:perf,
  author = {Bradley M. Broom},
  title = {Performance Measurement of the {Acacia} Parallel File System for the
  {AP1000} Multicomputer},
  booktitle = {Proc. Second Parallel Computing Workshop},
  year = {1993},
  month = {November},
  pages = {{P1-F-1} to {P1-F-11}},
  publisher = {Fujitsu Parallel Computing Research Facilities, Fujitsu
  Laboratories Ltd.},
  address = {Kawasaki, Japan},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {They evaluate the performance of Acacia with some simple synthetic
  benchmarks. Performance limited by implementation problems in the sequential
  file system. Otherwise no real surprises.}
}

@InProceedings{brown:benchmarks,
  author = {Aaron Brown and David A. Patterson},
  title = {Towards Availability Benchmarks: A Case Study of Software {RAID}
  System},
  booktitle = {Proceedings of the 2000 USENIX Technical Conference},
  year = {2000},
  pages = {263--276},
  publisher = {USENIX Association},
  URL =
  {http://www.usenix.org/publications/library/proceedings/usenix2000/general/brown.html},
  keywords = {RAID, disk array, parallel I/O, pario-bib},
  abstract = {Benchmarks have historically played a key role in guiding the
  progress of computer science systems research and development, but have
  traditionally neglected the areas of availability, maintainability, and
  evolutionary growth, areas that have recently become critically important in
  high-end system design. As a first step in addressing this deficiency, we
  introduce a general methodology for benchmarking the availability of computer
  systems. Our methodology uses fault injection to provoke situations where
  availability may be compromised, leverages existing performance benchmarks
  for workload generation and data collection, and can produce results in both
  detail-rich graphical presentations or in distilled numerical summaries. We
  apply the methodology to measure the availability of the software RAID
  systems shipped with Linux, Solaris 7 Server, and Windows 2000 Server, and
  find that the methodology is powerful enough not only to quantify the impact
  of various failure conditions on the availability of these systems, but also
  to unearth their design philosophies with respect to transient errors and
  recovery policy.}
}

@InProceedings{browne:io-arch,
  author = {J. C. Browne and A. G. Dale and C. Leung and R. Jenevein},
  title = {A Parallel Multi-Stage {I/O} Architecture with Self-managing Disk
  Cache for Database Management Applications},
  booktitle = {Proceedings of the Fourth International Workshop on Database
  Machines},
  year = {1985},
  month = {March},
  publisher = {Springer-Verlag},
  keywords = {parallel I/O, disk caching, database, pario-bib},
  comment = {A fancy interconnection from procs to I/O processors, intended
  mostly for DB applications, that uses cache at I/O end and a switch with
  smarts. Cache is associative. Switch helps out in sort and join operations.}
}

@TechReport{bruce:chimp,
  author = {R. A. A. Bruce and S. R. Chapple and N. B. MacDonald and A. S.
  Trew},
  title = {{CHIMP} and {PUL}: Support for Portable Parallel Programming},
  year = {1993},
  month = {March},
  number = {EPCC-TR93-07},
  institution = {Edinburgh Parallel Computing Center},
  URL = {file://ftp.epcc.ed.ac.uk/pub/pul/chimp-pul-overview.ps},
  keywords = {parallel programming, parallel I/O, pario-bib},
  comment = {An overview of the CHIMP message-passing library and the PUL set
  of libraries. Key design goal is portability; they run on many systems. PUL
  includes PUL-GF, which supports parallel access to files (see chapple:pulgf,
  chapple:pulgf-adv, and chapple:pario). Other PUL libraries support grids and
  meshes, global communications, and task farms. Contact pul@epcc.ed.ac.uk.}
}

@InProceedings{brunet:factor,
  author = {Jean-Philippe Brunet and Palle Pedersen and S.~Lennart Johnsson},
  title = {Load-Balanced {LU} and {QR} Factor and Solve Routines for Scalable
  Processors with Scalable {I/O}},
  booktitle = {Proceedings of the 17th IMACS World Congress},
  year = {1994},
  month = {July},
  address = {Atlanta, GA},
  note = {Also available as Harvard University Computer Science Technical
  Report TR-20-94.},
  URL = {ftp://das-ftp.harvard.edu/techreports/tr-20-94.ps.gz},
  keywords = {parallel I/O, linear algebra, out-of-core, pario-bib},
  abstract = {The concept of block-cyclic order elimination can be applied to
  out-of-core $LU$ and $QR$ matrix factorizations on distributed memory
  architectures equipped with a parallel I/O system. This elimination scheme
  provides load balanced computation in both the factor and solve phases and
  further optimizes the use of the network bandwidth to perform I/O operations.
  Stability of LU factorization is enforced by full column pivoting.
  Performance results are presented for the Connection Machine system CM-5.},
  comment = {Short, not many details. Performance results shows about 3.5
  Gflops for all problem sizes, both in-core on small N and out-of-core on
  large N.}
}

@Article{cabrera:pario,
  author = {Luis-Felipe Cabrera and Darrell D. E. Long},
  title = {Swift: {Using} Distributed Disk Striping to Provide High {I/O} Data
  Rates},
  journal = {Computing Systems},
  year = {1991},
  month = {Fall},
  volume = {4},
  number = {4},
  pages = {405--436},
  earlier = {cabrera:pariotr},
  keywords = {parallel I/O, disk striping, distributed file system, pario-bib},
  comment = {See cabrera:swift, cabrera:swift2. Describes the performance of a
  Swift prototype and simulation results. They stripe data over multiple disk
  servers (here SPARC SLC with local disk), and access it from a SPARC2 client.
  Their prototype gets nearly linear speedup for reads and asynchronous writes;
  synchronous writes are slower. They hit the limit of the Ethernet and/or the
  client processor with three disk servers. Adding another Ethernet allowed
  them to go higher. Simulation shows good scaling. Seems like a smarter
  implementation would help, as would special- purpose parity-computation
  hardware. Good arguments for use of PID instead of RAID, to avoid a
  centralized controller that is both a bottleneck and a single point of
  failure.}
}

@TechReport{cabrera:pariotr,
  author = {Luis-Felipe Cabrera and Darrell D. E. Long},
  title = {Swift: {Using} Distributed Disk Striping to Provide High {I/O} Data
  Rates},
  year = {1991},
  number = {CRL-91-46},
  institution = {UC Santa Cruz},
  later = {cabrera:pario},
  URL = {ftp://ftp.cse.ucsc.edu/pub/tr/ucsc-crl-91-46.ps.Z},
  keywords = {parallel I/O, disk striping, distributed file system, pario-bib}
}

@TechReport{cabrera:stripe,
  author = {Luis-Felipe Cabrera and Darell D. E. Long},
  title = {Using Data Striping in a Local Area Network},
  year = {1992},
  month = {March},
  number = {UCSC-CRL-92-09},
  institution = {Univ. California at Santa Cruz},
  keywords = {striping, parallel I/O, distributed system, pario-bib},
  comment = {See cabrera:swift2, cabrera:swift, cabrera:pario. Not much new
  here. Simulates higher-performance architectures. Shows reasonable
  scalability. Counts 5 inst/byte for parity computation.}
}

@TechReport{cabrera:swift,
  author = {Luis-Felipe Cabrera and Darrell D. E. Long},
  title = {Swift: A Storage Architecture fo Large Objects},
  year = {1990},
  number = {UCSC-CRL-89-04},
  institution = {U.C. Santa Cruz},
  later = {cabrera:swift2},
  URL = {ftp://ftp.cse.ucsc.edu/pub/tr/ucsc-crl-89-04.tar.Z},
  keywords = {parallel I/O, disk striping, distributed file system, multimedia,
  pario-bib},
  comment = {See cabrera:swift2. A brief outline of a design for a
  high-performance storage system, designed for storing and retrieving large
  objects like color video or visualization data at very high speed. They
  distribute data over several ``storage agents'', which are some form of disk
  or RAID. They are all connected by a high-speed network. A ``storage
  manager'' decides where to spread each file, what kind of reliability
  mechanism is used. User provides preallocation info such as size, reliability
  level, data rate requirements, and so forth.}
}

@InProceedings{cabrera:swift2,
  author = {Luis-Felipe Cabrera and Darell D. E. Long},
  title = {Exploiting Multiple {I/O} Streams to Provide High Data-Rates},
  booktitle = {Proceedings of the 1991 Summer USENIX Technical Conference},
  year = {1991},
  pages = {31--48},
  earlier = {cabrera:swift},
  keywords = {parallel I/O, disk striping, distributed file system, multimedia,
  pario-bib},
  comment = {See also cabrera:swift. More detail than the other paper.
  Experimental results from a prototype that stripes files across a distributed
  file system. Gets almost linear speedup in certain cases. Much better than
  NFS. Simulation to extend it to larger systems.}
}

@InProceedings{calderon:implement,
  author = {Alejandro Calder\'on and F\'elix Garc{\'\i}a and Jes\'us Carretero
  and Jose M. P\'erez and Javier Fern\'andez},
  title = {An Implementation of {MPI-IO} on {Expand}: A Parallel File System
  Based on {NFS} Servers},
  booktitle = {Recent Advances in Parallel Virtual Machine and Message Passing
  Interface},
  year = {2002},
  series = {Lecture Notes in Computer Science},
  volume = {2474},
  pages = {306--313},
  publisher = {Springer-Verlag},
  URL =
  {http://link.springer.de/link/service/series/0558/bibs/2474/24740306.htm},
  URLpdf =
  {http://link.springer.de/link/service/series/0558/papers/2474/24740306.pdf},
  keywords = {parallel I/O, multiprocessor file system, NFS, pario-bib},
  abstract = {This paper describes an implementation of MPI-IO using aB new
  parallel file system, called Expand (Expandable Parallel File System)1, that
  is based on NFS servers. Expand combines multiple NFS servers to create a
  distributed partition where files are declustered. Expand requires no changes
  to the NFS server and uses RPC operations to provide parallel access to the
  same file. Expand is also independent of the clients, because all operations
  are implemented using RPC and NFS protocol. The paper describes the design,
  the implementation and the evaluation of Expand with MPI-IO. This evaluation
  has been made in Linux clusters and compares Expand and PVFS.}
}

@Article{cannataro:data-intensive,
  author = {Mario Cannataro and Domenico Talia and Pradip K. Srimani},
  title = {Parallel data intensive computing in scientific and commercial
  applications},
  journal = {Parallel Computing},
  year = {2002},
  month = {May},
  volume = {28},
  number = {5},
  pages = {673--704},
  publisher = {Elsevier Science},
  URL = {http://www.elsevier.com/gej-ng/10/35/21/60/57/28/abstract.html},
  keywords = {parallel application, parallel I/O, pario-bib},
  abstract = {Applications that explore, query, analyze, visualize, and, in
  general, process very large scale data sets are known as Data Intensive
  Applications. Large scale data intensive computing plays an increasingly
  important role in many scientific activities and commercial applications,
  whether it involves data mining of commercial transactions, experimental data
  analysis and visualization, or intensive simulation such as climate modeling.
  By combining high performance computation, very large data storage, high
  bandwidth access, and high- speed local and wide area networking, data
  intensive computing enhances the technical capabilities and usefulness of
  most systems. The integration of parallel and distributed computational
  environments will produce major improvements in performance for both
  computing intensive and data intensive applications in the future. The
  purpose of this introductory article is to provide an overview of the main
  issues in parallel data intensive computing in scientific and commercial
  applications and to encourage the reader to go into the more in-depth
  articles later in this special issue.}
}

@Article{cao:jtickertaip,
  author = {Pei Cao and Swee Boon Lim and Shivakumar Venkataraman and John
  Wilkes},
  title = {The {TickerTAIP} parallel {RAID} architecture},
  journal = {ACM Transactions on Computer Systems},
  year = {1994},
  month = {August},
  volume = {12},
  number = {3},
  pages = {236--269},
  publisher = {ACM Press},
  earlier = {cao:tickertaip},
  keywords = {parallel I/O, RAID, pario-bib},
  comment = {See cao:tickertaip-tr2.}
}

@InProceedings{cao:tickertaip,
  author = {Pei Cao and Swee Boon Lim and Shivakumar Venkataraman and John
  Wilkes},
  title = {The {TickerTAIP} parallel {RAID} architecture},
  booktitle = {Proceedings of the 20th Annual International Symposium on
  Computer Architecture},
  year = {1993},
  pages = {52--63},
  earlier = {cao:tickertaip-tr2},
  later = {cao:jtickertaip},
  keywords = {parallel I/O, RAID, pario-bib},
  comment = {Superceded by cao:tickertaip-tr2 and cao:jtickertaip.}
}

@TechReport{cao:tickertaip-tr,
  author = {Pei Cao and Swee Boon Lim and Shivakumar Venkataraman and John
  Wilkes},
  title = {The {TickerTAIP} parallel {RAID} architecture},
  year = {1992},
  month = {December},
  number = {HPL-92-151},
  institution = {HP Labs},
  later = {cao:tickertaip-tr2},
  keywords = {parallel I/O, RAID, pario-bib},
  comment = {A parallelized RAID architecture that distributes the RAID
  controller operations across several worker nodes. Multiple hosts can connect
  to different workers, allowing multiple paths into the array. The workers
  then communicate on their own fast interconnect to accomplish the requests,
  distributing parity computations across multiple workers. They get much
  better performance and reliability than plain RAID. They built a prototype
  and a performance simulator. Two-phase commit was needed for request
  atomicity, and a request sequencer was needed for serialization. Also found
  it was good to give the whole request info to all workers and to let them
  figure out what to do and when. Superceded by cao:tickertaip-tr2 and
  cao:tickertaip.}
}

@TechReport{cao:tickertaip-tr2,
  author = {Pei Cao and Swee Boon Lim and Shivakumar Venkataraman and John
  Wilkes},
  title = {The {TickerTAIP} parallel {RAID} architecture},
  year = {1993},
  month = {April},
  number = {HPL-93-25},
  institution = {HP Labs},
  earlier = {cao:tickertaip-tr},
  later = {cao:tickertaip},
  keywords = {parallel I/O, RAID, pario-bib},
  comment = {Revised version of cao:tickertaip, actually: ``It's the ISCA paper
  with some text edits plus some new results on what happens if you turn disk
  request-scheduling on. It's been sent to TOCS.''. Thus it supercedes both
  cao:tickertaip-tr and cao:tickertaip. Eventually published as
  cao:jtickertaip.}
}

@Article{carballeira:adaptive,
  author = {Felix Garcia-Carballeira and Jesus Carretero and Alejandro Calderon
  and Jose M. Perez and Jose D. Garcia},
  title = {An adaptive cache coherence protocol specification for parallel
  input/output systems},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {2004},
  month = {June},
  volume = {15},
  number = {6},
  pages = {533--545},
  institution = {Univ Carlos III Madrid, Comp Architecture Grp, Madrid 28911,
  Spain; Univ Carlos III Madrid, Comp Architecture Grp, Madrid 28911, Spain},
  publisher = {IEEE Computer Society Press},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL = {http://csdl.computer.org/comp/trans/td/2004/06/l0533abs.htm},
  keywords = {parallel file system, caching, cache coherence, adaptive caching,
  protocol specification, pario-bib},
  abstract = {Caching has been intensively used in memory and traditional file
  systems to improve system performance. However, the use of caching in
  parallel file systems and I/O libraries has been limited to I/O nodes to
  avoid cache coherence problems. In this paper, we specify an adaptive cache
  coherence protocol very suitable for parallel file systems and parallel I/O
  libraries. This model exploits the use of caching, both at processing and I/O
  nodes, providing performance increase mechanisms as aggressive prefetching
  and delayed-write techniques. The cache coherence problem is solved by using
  a dynamic scheme of cache coherence protocols with different sizes and shapes
  of granularity. The proposed model is very appropriate for parallel I/O
  interfaces, as MPI-IO. Performance results, obtained on an IBM SP2, are
  presented to demonstrate the advantages offered by the cache management
  methods proposed.}
}

@InProceedings{carey:shore,
  author = {Michael J. Carey and David J. DeWitt and Michael J. Franklin and
  Nancy E. Hall and Mark L. McAuliffe and Jeffrey F. Naughton and Daniel T.
  Schuh and Marvin H. Solomon and C. K. Tan and Odysseas G. Tsatalos and Seth
  J. White and Michael J. Zwilling},
  title = {Shoring Up Persistent Applications},
  booktitle = {Proceedings of the ACM SIGMOD International Conference on
  Management of Data},
  year = {1994},
  pages = {383--394},
  publisher = {ACM Press},
  keywords = {persistent systems, database, parallel I/O, object-oriented,
  pario-bib},
  comment = {SHORE is a persistent object database system. It is intended for
  parallel or distributed systems, and attempts to combine both DB and file
  system features. Everything in the database is a typed object, in that there
  is a registered interface object that defines this type, including the basic
  data types of elements of the object, and methods that manipulate the object.
  Every object has an OID, and objects can refer to other objects with the OID.
  But they also support unix-like namespace, in which the names refer to
  objects by giving the OID. They also have a unix-compatibility library that
  provides access to many objects through the unix file interface. Every node
  has a SHORE server, and applications talk to their local server for all their
  needs. The local server talks to other servers as needed. The servers are
  also responsible for caching pages and managing locks and transactions.}
}

@InProceedings{carns:pvfs,
  author = {Philip H. Carns and Walter B. {Ligon III} and Robert B. Ross and
  Rajeev Thakur},
  title = {{PVFS}: A Parallel File System for Linux Clusters},
  booktitle = {Proceedings of the 4th Annual Linux Showcase and Conference},
  year = {2000},
  month = {October},
  pages = {317--327},
  publisher = {USENIX Association},
  address = {Atlanta, GA},
  URL = {http://www.mcs.anl.gov/~thakur/papers/pvfs.ps},
  keywords = {parallel I/O, parallel file system, cluster file system, Linux,
  pario-bib},
  abstract = {As Linux clusters have matured as platforms for low-cost,
  high-performance parallel computing, software packages to provide many key
  services have emerged, especially in areas such as message passing and
  networking. One area devoid of support, however, has been parallel file
  systems, which are critical for high-performance I/O on such clusters. We
  have developed a parallel file system for Linux clusters, called the Parallel
  Virtual File System (PVFS). PVFS is intended both as a high-performance
  parallel file system that anyone can download and use and as a tool for
  pursuing further research in parallel I/O and parallel file systems for Linux
  clusters. \par In this paper, we describe the design and implementation of
  PVFS and present performance results on the Chiba City cluster at Argonne. We
  provide performance results for a workload of concurrent reads and writes for
  various numbers of compute nodes, I/O nodes, and I/O request sizes. We also
  present performance results for MPI-IO on PVFS, both for a concurrent
  read/write workload and for the BTIO benchmark. We compare the I/O
  performance when using a Myrinet network versus a fast-ethernet network for
  I/O-related communication in PVFS. We obtained read and write bandwidths as
  high as 700~Mbytes/sec with Myrinet and 225~Mbytes/sec with fast ethernet.},
  comment = {won the Best Paper Award.}
}

@TechReport{carretero:case,
  author = {J. Carretero and F. {P\'erez} and P. de~Miguel and F. {Garc\'{\i}a}
  and L. Alonso},
  title = {Implementation of a Parallel File System: {CCFS} a Case of Study},
  year = {1994},
  number = {FIM/84.1/DATSI/94},
  institution = {Universidad Politecnic Madrid},
  address = {Madrid, Spain},
  URL = {http://laurel.datsi.fi.upm.es/~gp/publications/datsi84.1.ps.Z},
  keywords = {multiprocessor file system, parallel I/O, pario-bib},
  abstract = {This document briefly describes the components of the Cache
  Coherent File System (CCFS) source code. CCFS has three main components:
  Client File Server (CLFS), Local File Server (LFS), Concurrent Disk System
  (CDS). The main modules and functions of each component are described here.
  Special emphasys has been put on interfaces and data structures.},
  comment = {See carretero:*, rosales:cds, perez:clfs.}
}

@InProceedings{carretero:compassion,
  author = {J. Carretero and J. No and S.-S. Park and A. Choudhary and P.
  Chen},
  title = {COMPASSION: a parallel {I/O} runtime system including chunking and
  compression for irregular applications},
  booktitle = {Proceedings of the International Conference on High-Performance
  Computing and Networking},
  year = {1998},
  month = {April},
  pages = {668--677},
  later = {carretero:compassion2},
  keywords = {PASSION, parallel I/O, compression, collective I/O, two-phase
  I/O, performance evaluation, pario-bib},
  abstract = {We present two designs, namely, "collective I/O" and "pipelined
  collective I/O", of a runtime library for irregular applications based on the
  two-phase collective I/O technique. We also present the optimization of both
  models by using chunking and compression mechanisms. In the first scheme, all
  processors participate in compressions and I/O at the same time, making
  scheduling of I/O requests simpler but creating a possibility of contention
  at the I/O nodes. In the second approach, processors are grouped into several
  groups, overlapping communication, compression, and I/O to reduce I/O
  contention dynamically. Finally, evaluation results are shown that
  demonstrates that we can obtain significantly high-performance for I/O above
  what has been possible so far}
}

@InProceedings{carretero:compassion2,
  author = {J. Carretero and Jaechun No and A. Choudhary and Pang Chen},
  title = {{COMPASSION}: a parallel {I/O} runtime system including chunking and
  compression for irregular applications},
  booktitle = {Proceedings of the Fifth International Symposium on Solving
  Irregularly Structured Problems in Parallel (IRREGULAR'98)},
  year = {1998},
  month = {August},
  pages = {262--273},
  earlier = {carretero:compassion},
  keywords = {PASSION, parallel I/O, compression, collective I/O, pario-bib},
  abstract = {In this paper we present an experimental evaluation of
  COMPASSION, a runtime system for irregular applications based on collective
  I/O techniques. It provides a "Collective I/O" model, enhanced with
  "Pipelined" operations and compression. All processors participate in the I/O
  simultaneously, alone or grouped, making scheduling of I/O requests simpler
  and providing support for contention management. In-memory compression
  mechanisms reduce the total execution time by diminishing the amount of I/O
  requested and the I/O contention. Our experiments, executed on an Intel
  Paragon and on the ASCI/Red teraflops machine, demonstrate that COMPASSION
  can obtain significantly high-performance for I/O above what has been
  possible so far.}
}

@TechReport{carretero:concepts,
  author = {J. Carretero and F. {P\'erez} and P. de~Miguel and F. {Garc\'{\i}a}
  and L. Alonso},
  title = {Multicomputer Parallel File Systems Design Concepts: {CCFS} a case
  of study},
  year = {1994},
  number = {FIM/79.1/DATSI/94},
  institution = {Universidad Politecnic Madrid},
  address = {Madrid, Spain},
  URL = {http://laurel.datsi.fi.upm.es/~gp/publications/datsi79.1.ps.Z},
  keywords = {multiprocessor file system, parallel I/O, pario-bib},
  comment = {See carretero:*, rosales:cds, perez:clfs.}
}

@Article{carretero:evaluation,
  author = {J. Carretero and F. {P\'erez} and P. de~Miguel and F. {Garc\'{\i}a}
  and L. Alonso},
  title = {A Multiprocessor Parallel Disk System Evaluation},
  journal = {Decentralized and Distributed Systems},
  year = {1993},
  month = {September},
  publisher = {North Holland},
  note = {IFIP Transactions A-39},
  URL = {http://laurel.datsi.fi.upm.es/~gp/publications/first_esprit.ps.Z},
  keywords = {multiprocessor file system, parallel I/O, pario-bib},
  abstract = {This paper presents a Parallel Disk System (PDS) for general
  purpose multiprocessors, which provides support for conventional file systems
  and databases, as well as direct access for applications requiring high
  performance mass storage. We present a systematic method to characterize a
  parallel I/O system, using it to evaluate PDS and to identify an optimal PDS
  configuration. Several devices (single disk, Raid3 and Raid5), and different
  configurations of I/O nodes, each one with a different type of device, have
  been simulated. Throughput and I/O rate of each configuration have been
  obtained for the former configurations and different types of workloads
  (database, general purpose and scientific applications).},
  comment = {See carretero:*, rosales:cds, perez:clfs.}
}

@TechReport{carretero:lfs,
  author = {J. Carretero and F. {P\'erez} and P. de~Miguel and F. {Garc\'{\i}a}
  and L. Alonso},
  title = {{LFS} Design: A Parallel File Server for Multicomputers},
  year = {1994},
  number = {FIM/81.1/DATSI/94},
  institution = {Universidad Politecnic Madrid},
  address = {Madrid, Spain},
  URL = {http://laurel.datsi.fi.upm.es/~gp/publications/datsi81.1.ps.Z},
  keywords = {multiprocessor file system, parallel I/O, pario-bib},
  abstract = {This document describes the detailed design of the LFS, one of
  the components of the Cache Coherent File System (CCFS). CCFS has three main
  components: Client File Server (CLFS), Local File Server (LFS), Concurrent
  Disk System (CDS). The Local File Servers are located on each disk node, to
  develop file server functions in a per node basis. The LFS will interact with
  the Concurrent Disk System (CDS) to execute real input/output and to manage
  the disk system, partitions, distributed partitions, etc. The LFS includes
  general file system services and specialized services, and it will be
  responsible of maintaining cache consistency, distributing accesses to other
  servers, controlling partition information, etc.},
  comment = {See carretero:*, rosales:cds, perez:clfs.}
}

@InProceedings{carretero:mapping,
  author = {J. Carretero and F. P\'{e}rez and P. {de Miguel} and F. Garc\'{\i}a
  and L. Alonso},
  title = {{I/O} Data Mapping in {{\em ParFiSys:}} Support for High-Performance
  {I/O} in Parallel and Distributed Systems},
  booktitle = {Euro-Par~'96},
  year = {1996},
  month = {August},
  series = {Lecture Notes in Computer Science},
  volume = {1123},
  pages = {522--526},
  publisher = {Springer-Verlag},
  URL = {http://laurel.datsi.fi.upm.es/~gp/publications/europar96.ps.Z},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  abstract = {This paper gives an overview of the I/O data mapping mechanisms
  of {\em ParFiSys}. Grouped management and parallelization are presented as
  relevant features. I/O data mapping mechanisms of {\em ParFiSys}, including
  all levels of the hierarchy, are described in this paper.}
}

@Article{carretero:parfisys,
  author = {J. Carretero and F. {P\'erez} and P. de~Miguel and F. {Garc\'{\i}a}
  and L. Alonso},
  title = {{ParFiSys}: A Parallel File System for {MPP}},
  journal = {ACM Operating Systems Review},
  year = {1996},
  month = {April},
  volume = {30},
  number = {2},
  pages = {74--80},
  keywords = {multiprocessor file system, parallel I/O, pario-bib},
  comment = {See carretero:*, rosales:cds, perez:clfs.}
}

@Article{carretero:performance,
  author = {J. Carretero and F. P\'{e}rez and P. {de Miguel} and F. Garc\'{i}a
  and L. Alonso},
  title = {Performance increase mechanisms for parallel and distributed file
  systems},
  journal = {Parallel Computing},
  year = {1997},
  month = {June},
  volume = {23},
  number = {4},
  pages = {525--542},
  publisher = {North-Holland (Elsevier Scientific)},
  keywords = {parallel I/O, multiprocessor file system, pario-bib}
}

@TechReport{carretero:posix,
  author = {J. Carretero and F. {P\'erez} and P. de~Miguel and F. {Garc\'{\i}a}
  and L. Alonso},
  title = {Prototype {POSIX}-Style Parallel File Server and Report for the
  {CS-2}},
  year = {1993},
  number = {D1.7/1},
  institution = {Universidad Politecnic Madrid},
  address = {Madrid, Spain},
  URL = {http://laurel.datsi.fi.upm.es/~gp/publications/first_esprit.ps.Z},
  keywords = {multiprocessor file system, parallel I/O, pario-bib},
  comment = {See carretero:*, rosales:cds, perez:clfs.}
}

@TechReport{carretero:posix-final,
  author = {J. Carretero and F. {P\'erez} and P. de~Miguel and F. {Garc\'{\i}a}
  and L. Alonso},
  title = {{POSIX}-Style Parallel File Server for the {GPMIMD}: Final Report},
  year = {1995},
  number = {D1.7/2},
  institution = {Universidad Politecnic Madrid},
  address = {Madrid, Spain},
  URL = {http://laurel.datsi.fi.upm.es/~gp/publications/final_esprit.ps.Z},
  keywords = {multiprocessor file system, parallel I/O, pario-bib},
  comment = {See carretero:*, rosales:cds, perez:clfs.}
}

@Article{carretero:subsystem,
  author = {J. Carretero and F. {P\'erez} and P. de~Miguel and F. {Garc\'{\i}a}
  and L. Alonso},
  title = {A Massively Parallel and Distributed {I/O} Subsystem},
  journal = {Computer Architecture News},
  year = {1996},
  month = {June},
  volume = {24},
  number = {3},
  pages = {1--8},
  keywords = {parallel I/O, I/O architecture, pario-bib},
  comment = {See carretero:*, rosales:cds, perez:clfs.}
}

@TechReport{carter:benchmark,
  author = {Russell Carter and Bob Ciotti and Sam Fineberg and Bill Nitzberg},
  title = {{NHT-1} {I/O} Benchmarks},
  year = {1992},
  month = {November},
  number = {RND-92-016},
  institution = {NAS Systems Division, NASA Ames},
  URL =
  {http://www.nas.nasa.gov/NAS/TechReports/RNDreports/RND-92-016/RND-92-016.html},
  keywords = {parallel I/O, benchmark, pario-bib},
  comment = {Specs for three scalable-I/O benchmarks to be used for evaluating
  I/O for multiprocessors. One measures application I/O by mixing I/O and
  computation, one measures max disk I/O by reading and writing 80\% of the
  total RAM memory, and the last one is for sending that data from the file
  system, through the network, and back. See fineberg:nht1.}
}

@TechReport{carter:vesta,
  author = {Matthew P. Carter and David Kotz},
  title = {An Implementation of the {Vesta} Parallel File System {API} on the
  {Galley} Parallel File System},
  year = {1998},
  month = {April},
  number = {PCS-TR98-329},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/160/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/carter:vesta.pdf},
  keywords = {parallel I/O, multiprocessor file system, pario-bib, dfk},
  abstract = {To demonstrate the flexibility of the Galley parallel file system
  and to analyze the efficiency and flexibility of the Vesta parallel file
  system interface, we implemented Vesta's application-programming interface on
  top of Galley. We implemented the Vesta interface using Galley's file-access
  methods, whose design arose from extensive testing and characterization of
  the I/O requirements of scientific applications for high-performance
  multiprocessors. We used a parallel CPU, parallel I/O, out-of-core
  matrix-multiplication application to test the Vesta interface in both its
  ability to specify data access patterns and in its run-time efficiency. In
  spite of its powerful ability to specify the distribution of regular,
  non-overlapping data access patterns across disks, we found that the Vesta
  interface has some significant limitations. We discuss these limitations in
  detail in the paper, along with the performance results.},
  comment = {See also http://www.cs.dartmouth.edu/~dfk/nils/galley.html}
}

@InProceedings{catania:array,
  author = {v. Catania and A. Puliafito and S. Riccobene and L. Vita},
  title = {Performance Evaluation of a Partial Dynamic Declustering Disk Array
  System},
  booktitle = {Proceedings of the Third IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1994},
  month = {August},
  pages = {244--252},
  keywords = {parallel I/O, disk array, pario-bib},
  abstract = {With a view to improving the performance and the fault tolerance
  of mass storage units, this paper concentrates on the architectural issues of
  parallelizing I/O access and a disk array system by means of definition of a
  new, particularly flexible architecture, called Partial Dynamic Declustering,
  which is fault-tolerant and offers higher levels of performance and
  reliability than the solutions normally used. A fast distributed algorithm
  based on a dynamic structure and usable for the implementation of an
  efficient I/O subsystem manager is proposed. Particular attention is also
  paid to the definition of analytical models based on Stochastic Reward Petri
  nets in order to analyze the performance and reliability of the system
  proposed.}
}

@Article{catania:disk-array,
  author = {V. Catania and A. Puliafito and S. Riccobene and L. Vita},
  title = {Design and Performance Analysis of a Disk Array System},
  journal = {IEEE Transactions on Computers},
  year = {1995},
  month = {October},
  volume = {44},
  number = {10},
  pages = {1236--1247},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, disk array, pario-bib},
  abstract = {We concentrate on the architectural issues of parallelizing I/O
  access in a disk array system by means of definition of a new, particularly
  flexible architecture, called partial dynamic declustering, which is
  fault-tolerant and offers higher levels of performance and reliability than
  the solutions normally used. A simulation analysis highlights the efficiency
  of the proposed solution in balancing the file system workload and
  demonstrates its validity in both cases of unbalanced loads and expansion of
  the system. Particular attention is also paid to the definition of analytical
  models, based on stochastic reward nets, in order to analyze the performance
  and reliability of the system. The response time distribution function is
  evaluated and a specific performance analysis with varying degrees of
  declustering and workload is carried out.}
}

@Article{catania:mass,
  author = {V. Catania and A. Puliafito and S. Riccobene and L. Vita},
  title = {An {I/O} subsystem supporting mass storage functions in parallel
  systems},
  journal = {Computer standards \& interfaces},
  year = {1996},
  volume = {18},
  number = {2},
  pages = {117--138},
  keywords = {parallel I/O, pario-bib},
  abstract = {The introduction of multiprocessor architectures into computer
  systems has further increased the gap between processing times and access
  times to mass memories, thus making the processes more and more I/O-bound. To
  provide higher performance levels (both transfer rate and I/O rate), disk
  array technology is based on the use of a number of logically interconnected
  disks of a small size, in order to replace disks which have a large capacity
  but are very expensive. With a view to improving the performance and fault
  tolerance of the mass storage units, this paper concentrates on the
  architectural issues of parallelizing I/O access in a disk array system by
  means of definition of a new, particularly flexible architecture, called
  Partial Dynamic Declustering, which is fault-tolerant and offers higher
  levels of performance and reliability than the solutions normally used. A
  fast distributed algorithm based on a dynamic structure and usable for the
  implementation of an efficient I/O subsystem manager is proposed and
  evaluated by a simulative analysis. A specific study also characterizes the
  system's performance with varying degrees of declustering and workload types
  (from the transactional to the scientific type). The results obtained allow
  us to obtain the optimal configuration of the system (number of disks per
  group) which will ensure the desired response time values for varying
  workloads.}
}

@Article{cecchet:raidb,
  author = {Emmanuel Cecchet and Julie Marguerite and Willy Zwaenepoel},
  title = {Partial replication: Achieving scalability in redundant arrays of
  inexpensive databases},
  journal = {Lecture Notes in Computer Science},
  booktitle = {7th International Conference on  Principles of Distributed
  Systems (OPODIS 2003); December 10-13, 2003; MARTINIQUE},
  editor = {Papatrianatafilou, M; Hunel, P},
  year = {2004},
  month = {July},
  volume = {3144},
  pages = {58--70},
  publisher = {Springer-Verlag Heidelberg},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL = {http://springerlink.metapress.com/link.asp?id=kay13m7clgg75utk},
  keywords = {replication strategies, RAIDb, database, pario-bib},
  abstract = {Clusters of workstations become more and more popular to power
  data server applications such as large scale Web sites or e-Commerce
  applications. There has been much research on scaling the front tiers (web
  servers and application servers) using clusters, but databases usually remain
  on large dedicated SMP machines. In this paper, we focus on the database tier
  using clusters of commodity hardware. Our approach consists of studying
  different replication strategies to achieve various degree of performance and
  fault tolerance.  Redundant Array of Inexpensive Databases (RAIDb) is to
  databases what RAID is to disks. In this paper, we focus on RAIDb-1 that
  offers full replication and RAIDb-2 that introduces partial replication, in
  which the user can define the degree of replication of each database table.
   We present a Java implementation of RAIDb called Clustered JDBC or C-JDBC.
  C-JDBC achieves both database performance scalability and high availability
  at the middleware level without changing existing applications. We show,
  using the TPC-W benchmark, that partial replication (RAIDb-2) can offer
  better performance scalability (up to 25\%) than full replication by allowing
  fine-grain control on replication. Distributing and restricting the
  replication of frequently written tables to a small set of backends reduces
  I/O usage and improves CPU utilization of each cluster node.}
}

@InProceedings{cerin:sorting,
  author = {Christophe C\'erin and Hazem Fkaier and Mohamed Jemni},
  title = {A Synthesis of Parallel Out-of-core Sorting Programs on
  Heterogeneous Clusters},
  booktitle = {Proceedings of the Third IEEE/ACM International Symposium on
  Cluster Computing and the Grid},
  year = {2003},
  month = {May},
  pages = {78--85},
  publisher = {IEEE Computer Society Press},
  address = {Tokyo, Japan},
  URL =
  {http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190078abs.htm},
  keywords = {out-of-core, sorting, parallel I/O, load balancing, data
  distribution, pario-app, pario-bib},
  abstract = {The paper considers he problem of parallel external sorting in
  the contex of a form of heterogeneous clusters. We introduce two algorithms
  and we compare them two another one that we have previously developed. Since
  most common sort algorithms assume high-speed random access to all
  intermediate memory, they are unsuitable if the values to be sorted don't fit
  in main memory. This is the case for cluster computing platforms which are
  made of standard, cheap and scarce components. For that class of computing
  resources a good use of I/O operations compatible with the requirements of
  load balancing and computational complexity are the key to success. We
  explore three techniques and show how they can be deployed for clusters with
  processor performances related by a multiplicative factor. We validate the
  approaches in showing experimental results for the load balancing factor.}
}

@Article{ceron:dna,
  author = {C. Ceron and J. Dopazo and E. L. Zapata and J.M. Carazo and O.
  Trelles},
  title = {Parallel implementation of {DNAml} program on message-passing
  architectures},
  journal = {Parallel Computing},
  year = {1997},
  month = {June},
  volume = {24},
  number = {5--6},
  pages = {701--716},
  publisher = {Elsevier Science},
  URL = {http://dx.doi.org/10.1016/S0167-8191(98)00002-7},
  keywords = {parallel computers, run-time analysis, phylogenetic trees, DNAml
  program, source code, parallel I/O, pario-bib},
  abstract = {We present a new computing approach for the parallelization on
  message-passing computer architectures of the DNAml algorithm, one of the
  most powerful tools available for constructing phylogenetic trees from DNA
  sequences. An analysis of the data dependencies of the method gave little
  chances to develop an efficient parallel approach. However, a careful
  run-time analysis of the behaviour of the algorithm allowed us to propose a
  very efficient parallel implementation based on the combination of advanced
  dynamic scheduling strategies, speculative running-time execution decisions
  and I/O buffering. In this work, we discuss specific Parallel Virtual Machine
  (PVM)-based implementations for a cluster of workstations and for Distributed
  Memory multiprocessors, with high performance results. The code can be
  obtained from our public-domain sites.},
  comment = {They discuss the parallelization on message-passing computers of
  the {DNA}ml algorithm, a tool used to construct phylogenetic trees from {DNA}
  sequences. By performing a run-time analysis of the behavior of the algorithm
  they came up with an efficient parallel implementation based on dynamic
  scheduling strategies, speculative run-time execution decisions and I/O
  buffering. They use I/O buffering (prefetching) to fetch tasks that need to
  be processed. The parallel code was written in C using PVM for message
  passing and is avaialable via anonymous ftp at ftp.ac.uma.es.}
}

@Misc{cfs:lustre,
  key = {CFS},
  title = {Lustre: A Scalable, High-Performance File System},
  year = {2002},
  month = {November},
  howpublished = {Cluster File Systems Inc. white paper, version 1.0},
  note = {http://www.lustre.org/docs/whitepaper.pdf},
  URL = {http://www.lustre.org/docs/whitepaper.pdf},
  keywords = {object-based storage, distributed file system, parallel file
  system, pario-bib},
  comment = {Describes an open-source project to develop an object-based file
  system for clusters. Related to the NASD project at CMU
  (http://www.pdl.cs.cmu.edu/NASD/).}
}

@Article{cha:subgroup,
  author = {Kwangho Cha and Taeyoung Hong and Jeongwoo Hong},
  title = {The subgroup method for collective {I/O}},
  journal = {Lecture Notes in Computer Science},
  booktitle = {5th International Conference on Parallel and Distributed
  Computing; December 8-10, 2004; Singapore, SINGAPORE},
  editor = {Liew, KM; Shen, H; See, S; Cai, W; Fan, P; Horiguchi, S},
  year = {2004},
  month = {December},
  volume = {3320},
  pages = {301--304},
  institution = {Korea Inst Sci \& Technol Informat, Supercomp Ctr, 52 Eoeun,
  Taejon 305806, South Korea; Korea Inst Sci \& Technol Informat, Supercomp
  Ctr, Taejon 305806, South Korea},
  publisher = {SPRINGER-VERLAG BERLIN},
  copyright = {(c)2005 The Thomson Corporation},
  URL =
  {http://www.springerlink.com/openurl.asp?genre=article&issn=0302-9743&volume=3320&spage=301},
  keywords = {collective I/O, MPI subgroup, pario-bib},
  abstract = {Because many scientific applications require large data
  processing, the importance of parallel I/O has been increasingly recognized.
  For collective I/O, one of the considerable features of parallel I/O, we
  suggest the subgroup method. It is the way of using collective I/O of MPI
  effectively in terms of application programs. From the experimental results,
  we could conclude that the subgroup method for collective I/O is more
  efficient than plain collective I/O.}
}

@InProceedings{chandy:array,
  author = {John A. Chandy and Prithviraj Benerjee},
  title = {Reliability Evaluation of Disk Array Architectures},
  booktitle = {Proceedings of the 1993 International Conference on Parallel
  Processing},
  year = {1993},
  pages = {I--263--267},
  publisher = {CRC Press},
  address = {St. Charles, IL},
  keywords = {parallel I/O, disk array, pario-bib, RAID},
  comment = {A framework for evaluating the reliability of RAIDs. They consider
  failure and repair rates that depend on the workload.}
}

@InProceedings{chang:reuse,
  author = {Tai-Sheng Chang and Sangyup Shim and David H.~C. Du},
  title = {The Scalability of Spatial Reuse Based Serial Storage Interfaces},
  booktitle = {Proceedings of the Fifth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1997},
  month = {November},
  pages = {93--101},
  publisher = {ACM Press},
  address = {San Jose, CA},
  URL = {http://doi.acm.org/10.1145/266220.266229},
  keywords = {I/O interface, I/O network, I/O architecture, parallel I/O,
  pario-bib},
  abstract = {Due to the growing popularity of emerging applications such as
  digital libraries, Video-On Demand, distance learning, and Internet
  World-Wide Web, multimedia servers with a large capacity and high performance
  storage subsystem are in high demand. Serial storage interfaces are emerging
  technologies designed to improve the performance of such storage subsystems.
  They provide high bandwidth, fault tolerance, fair bandwidth sharing and long
  distance connection capability. All of these issues are critical in designing
  a scalable and high performance storage subsystem. Some of the serial storage
  interfaces provide the spatial reuse feature which allows multiple concurrent
  transmissions. That is, multiple hosts can access disks concurrently with
  full link bandwidth if their access paths are disjoint. Spatial reuse
  provides a way to build a storage subsystem whose aggregate bandwidth may be
  scaled up with the number of hosts. However, it is not clear how much the
  performance of a storage subsystem could be improved by the spatial reuse
  with different configurations and traffic scenarios. Both limitation and
  capability of this scalability need to be investigated. To understand their
  fundamental performance characteristics, we derive an analytic model for the
  serial storage interfaces with the spatial reuse feature. Based on this
  model, we investigate the maximum aggregate throughput from different system
  configurations and load distributions. We show how the number of disks needed
  to saturate a loop varies with different number of hosts and different load
  scenarios. We also show how the load balancing by uniformly distributing the
  load to all the disks on a loop may incur high overhead. This is because the
  accesses to far away disks need to go through many links and consume the
  bandwidth of each link it goes through. The results show the achievable
  throughput may be reduced by more than half in some cases.}
}

@InProceedings{chang:titan,
  author = {Chialin Chang and Bongki Moon and Anurag Acharya and Carter Shock
  and Alan Sussman and Joel Saltz},
  title = {{Titan}: a High-Performance Remote-sensing Database},
  booktitle = {Proceedings of the Thirteenth International Conference on Data
  Engineering},
  year = {1997},
  month = {April},
  address = {Birmingham, U.K.},
  URL = {ftp://hpsl.cs.umd.edu/pub/papers/icde97-final.ps.Z},
  keywords = {parallel databases, satellite imagery, remote sensing, parallel
  I/O, pario-bib},
  abstract = {There are two major challenges for a high-performance
  remote-sensing database. First, it must provide low-latency retrieval of very
  large volumes of spatio-temporal data. This requires effective declustering
  and placement of a multi-dimensional dataset onto a large disk farm. Second,
  the order of magnitude reduction in data-size due to post-processing makes it
  imperative, from a performance perspective, that the postprocessing be done
  on the machine that holds the data. This requires careful coordination of
  computation and data retrieval. This paper describes the design,
  implementation and evaluation of {\em Titan}, a parallel shared-nothing
  database designed for handling remote-sensing data. The computational
  platform for Titan is a 16-processor IBM SP-2 with four fast disks attached
  to each processor. Titan is currently operational and contains about 24~GB of
  AVHRR data from the NOAA-7 satellite. The experimental results show that
  Titan provides good performance for global queries and interactive response
  times for local queries.}
}

@TechReport{chao:datamesh,
  author = {Chia Chao and Robert English and David Jacobson and Bart Sears and
  Alexander Stepanov and John Wilkes},
  title = {{DataMesh} architecture 1.0},
  year = {1992},
  month = {December},
  number = {HPL-92-153},
  institution = {HP Labs},
  earlier = {wilkes:datamesh},
  URL = {http://www.hpl.hp.com/personal/John_Wilkes/papers/HPL-92-153.ps.Z},
  keywords = {parallel I/O, parallel file system, pario-bib},
  comment = {A more detailed spec of the datamesh architecture, specifying
  components and operations. It is a block server where blocks are
  associatively addressed by tags. Some search operations are supported, as are
  atomic tag-changing operations. See also cao:tickertaip, wilkes:datamesh1,
  wilkes:datamesh, wilkes:houses, wilkes:lessons.}
}

@Manual{chapple:pario,
  author = {S. R. Chapple and R. A. Fletcher},
  title = {{PUL-GF} Parallel {I/O} Concepts},
  year = {1993},
  month = {February},
  organization = {Edinburgh Parallel Computing Center},
  note = {EPCC-KTP-PUL-GF-PROT-CONC 1.0},
  URL = {file://ftp.epcc.ed.ac.uk/pub/pul/concepts-i.ps},
  keywords = {parallel I/O, pario-bib},
  comment = {See also bruce:chimp, chapple:pulgf, and chapple:pulgf-adv, for
  general information on CHIMP and PUL-GF. This document is an exploration of
  the potential ways to parallelize the underlying I/O support for the PUL-GF
  interface. They reason about tradeoffs in the number of servers, disks, and
  clients, but (as they note) without any performance evaluation to back it up.
  In particular, they argue that there should be one partition per disk, one
  server per disk, and probably one client to many servers, or many clients to
  many servers. A key assumption is that a traditional serial file system is
  the home location for files, and that files are ``converted'' into parallel
  files (or vice versa) by replicating or distributing them. Application could
  choose the number of servers (and hence disks) for each file. Hints could be
  provided about many things. Interesting idea to allow user hooks for cache
  prefetch and writeback functions. Support for variable-length records
  (``atoms'') is a key component. Segments of a file with different formats,
  e.g., a header and a matrix, may be separated into different components when
  the file is distributed into parallel form. See chapple:pulpf for info on the
  eventual realization of these ideas.}
}

@Manual{chapple:pulgf,
  author = {S. R. Chapple and S. M. Trewin},
  title = {{PUL-GF} Prototype User Guide},
  year = {1993},
  month = {February},
  organization = {Edinburgh Parallel Computing Center},
  note = {EPCC-KTP-PUL-GF-UG 0.1},
  URL = {file://ftp.epcc.ed.ac.uk/pub/pul/gf-prot-ug.ps},
  keywords = {parallel I/O, pario-bib},
  comment = {PUL is a set of libraries that run on top of the CHIMP portable
  message-passing library (see bruce:chimp). One of the PUL libraries is
  PUL-GF, to support file I/O. The underlying I/O support is not parallel (but
  see chapple:pario). The interface is parallel, however; in particular, it
  supports file modes similar to those used in many systems, which they call
  single, multi, random, and independent. Formatted and unformatted,
  synchronous and asynchronous. Very general multidimensional-array read and
  write functions. Ability to group multiple I/O requests into atomic units,
  though not a full transaction capability. See also chapple:pulgf-adv and
  chapple:pario.}
}

@Manual{chapple:pulgf-adv,
  author = {S. R. Chapple},
  title = {{PUL-GF} Prototype Advanced User Guide},
  year = {1993},
  month = {January},
  organization = {Edinburgh Parallel Computing Center},
  note = {EPCC-KTP-PUL-GF-PROT-ADV-UG 0.1},
  URL = {file://ftp.epcc.ed.ac.uk/pub/pul/gf-prot-adv-ug.ps.Z},
  keywords = {parallel I/O, pario-bib},
  comment = {See chapple:pulgf for a definition of PUL-GF. This document
  describes the internal client-server interface to PUL-GF, including ways that
  users can extend the functionality of PUL-GF. In particular, they give an
  example of how a new file format (a run-length encoded 2-d matrix) can be
  read and written transparently as if it were a plain matrix file. The
  extensibility is offered by run-time registration of user-defined
  interposition functions, to be called at key moments in the processing of a
  file I/O request. See also bruce:chimp and chapple:pario.}
}

@Manual{chapple:pulpf,
  author = {S. R. Chapple},
  title = {{PUL-PF} Reference Manual},
  year = {1994},
  month = {January},
  organization = {Edinburgh Parallel Computing Center},
  note = {EPCC-KTP-PUL-PF-PROT-RM 1.1},
  URL = {file://ftp.epcc.ed.ac.uk/pub/pul/pf-prot-rm.ps},
  keywords = {parallel I/O, multiprocessor file system interface, pario-bib},
  comment = {See also chapple:pulgf and chapple:pario. An evolution of their
  parallel I/O interface. PUL-PF is a library on top of existing file systems.
  Every process is either a client or a server; servers write some portion of
  the file to a file in the file system. Servers can be divided into groups so
  that files need not be spread across all servers. There seems to be client
  caching, with consistency controlled differently depending on access mode;
  when necessary, the application must call get-token and send-token commands
  to serialize access to an atom. Independently of their single, multi, random,
  and independent mode, they can read or write the next, previous, current, or
  ``wild'' atom (wild means the next ``most available'' atom not yet read by
  this process). Most I/O is on atoms, but particles (pieces of atoms) can also
  be independently read and written. Hints are supported to specify access
  pattern (random or sequential, stride), file partitioning, mapping, atom
  size, or caching. In many of those cases it goes beyond a hint to the supply
  of a user-defined function, e.g., for cache-replacement algorithm.}
}

@TechReport{chaudhry:relaxing,
  author = {Geeta Chaudhry and Thomas H. Cormen},
  title = {Relaxing the Problem-Size Bound for Out-of-Core Columnsort},
  year = {2003},
  month = {April},
  number = {TR2003-445},
  institution = {Dept. of Computer Science, Dartmouth College},
  address = {Hanover, NH},
  URL = {ftp://ftp.cs.dartmouth.edu/TR/TR2003-445.ps.Z},
  keywords = {parallel I/O, sorting, out-of-core applications, pario-bib},
  abstract = {Previous implementations of out-of-core columnsort limit the
  problem size to $N \leq \sqrt{(M/P)^3 / 2}$, where $N$ is the number of
  records to sort, $P$ is the number of processors, and $M$ is the total number
  of records that the entire system can hold in its memory (so that $M/P$ is
  the number of records that a single processor can hold in its memory). We
  implemented two variations to out-of-core columnsort that relax this
  restriction. Subblock columnsort is based on an algorithmic modification of
  the underlying columnsort algorithm, and it improves the problem-size bound
  to $N \leq (M/P)^{5/3} / 4^{2/3}$ but at the cost of additional disk I/O\.
  $M$-columnsort changes the notion of the column size in columnsort.}
}

@PhdThesis{chaudhry:thesis,
  author = {Geeta Chaudhry},
  title = {Parallel Out-of-Core Sorting: The Third Way},
  year = {2004},
  month = {September},
  number = {TR2004-517},
  institution = {Dartmouth College, Computer Science},
  school = {Dartmouth College},
  address = {Hanover, NH},
  note = {Available as Dartmouth Technical Report TR2004-517},
  URL = {https://digitalcommons.dartmouth.edu/dissertations/7/},
  keywords = {out-of-core sorting, columnsort, cluster computing, parallel I/O,
  pario-bib},
  abstract = {Sorting very large datasets is a key subroutine in almost any
  application that is built on top of a large database. Two ways to sort
  out-of-core data dominate the literature: merging-based algorithms and
  partitioning-based algorithms. Within these two paradigms, all the programs
  that sort out-of-core data on a cluster rely on assumptions about the input
  distribution. We propose a third way of out-of-core sorting: oblivious
  algorithms. In all, we have developed six programs that sort out-of-core data
  on a cluster. The first three programs, based completely on Leighton's
  columnsort algorithm, have a restriction on the maximum problem size that
  they can sort. The other three programs relax this restriction; two are based
  on our original algorithmic extensions to columnsort. We present experimental
  results to show that our algorithms perform well. To the best of our
  knowledge, the programs presented in this thesis are the first to sort
  out-of-core data on a cluster without making any simplifying assumptions
  about the distribution of the data to be sorted.},
  comment = {Doctoral dissertation. Advisor: Thomas H. Cormen}
}

@TechReport{chaudhry:tricks,
  author = {Geeta Chaudhry and Elizabeth A. Hamon and Thomas H. Cormen},
  title = {Stupid Columnsort Tricks},
  year = {2003},
  month = {April},
  number = {TR2003-444},
  institution = {Dept. of Computer Science, Dartmouth College},
  address = {Hanover, NH},
  URL = {ftp://ftp.cs.dartmouth.edu/TR/TR2003-444.ps.Z},
  keywords = {parallel I/O, sorting, out-of-core applications, pario-bib},
  abstract = {Leighton's columnsort algorithm sorts on an $r \times s$ mesh,
  subject to the restrictions that $s$ is a divisor of~$r$ and that $r \geq
  2s^2$ (so that the mesh is tall and thin). We show how to mitigate both of
  these restrictions. One result is that the requirement that $s$ is a divisor
  of~$r$ is unnecessary; columnsort sorts correctly whether or not $s$
  divides~$r$. We present two algorithms that, as long as $s$ is a perfect
  square, relax the restriction that $r \geq 2s^2$; both reduce the exponent
  of~$s$ to~$3/2$. One algorithm requires $r \geq 4s^{3/2}$ if $s$ divides~$r$
  and $r \geq 6s^{3/2}$ if $s$ does not divide~$r$. The other algorithm
  requires $r \geq 4^{3/2}$, and it requires $s$ to be a divisor of~$r$. Both
  algorithms have applications in increasing the maximum problem size in
  out-of-core sorting programs.}
}

@InProceedings{chehadeh:oodb,
  author = {Y.~C. Chehadeh and A.~R. Hurson and L.~L. Miller and S. Pakzad and
  B.~N. Jamoussi},
  title = {Application for parallel disks for efficient handling of
  object-oriented databases},
  booktitle = {Proceedings of the 1993 IEEE Symposium on Parallel and
  Distributed Processing},
  year = {1993},
  pages = {184--191},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, disk array, object oriented database, pario-bib},
  abstract = {In today's workstation based environment, applications such as
  design databases, multimedia databases, and knowledge bases do not fit well
  into the relational data processing framework. The object-oriented data model
  has been proposed to model and process such complex databases. Due to the
  nature of the supported applications, object-oriented database systems need
  efficient mechanisms for the retrieval of complex objects and the navigation
  along the semantic links among objects. Object clustering and buffering have
  been suggested as efficient mechanisms for the retrieval of complex objects.
  However, to improve the efficiency of the aforementioned operations, one has
  to look at the recent advances in storage technology. This paper is an
  attempt to investigate the feasibility of using parallel disks for
  object-oriented databases. It analyzes the conceptual changes needed to map
  the clustering and buffering schemes proposed on the new underlying
  architecture. The simulation and performance evaluation of the proposed
  leveled-clustering and mapping schemes utilizing parallel I/O disks are
  presented and analyzed.}
}

@InProceedings{chen:automatic,
  author = {Ying Chen and Marianne Winslett and Y. Cho and S. Kuo},
  title = {Automatic Parallel {I/O} Performance Optimization Using Genetic
  Algorithms},
  booktitle = {Proceedings of the Seventh IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1998},
  month = {July},
  pages = {155--162},
  publisher = {IEEE Computer Society Press},
  URL = {http://www.computer.org/proceedings/hpdc/8579/85790155abs.htm},
  keywords = {parallel I/O, performance optimization, genetic algorithm,
  pario-bib},
  abstract = {The complexity of parallel I/O systems imposes significant
  challenge in managing and utilizing the available system resources to meet
  application performance, portability and usability goals. We believe that a
  parallel I/O system that automatically selects efficient I/O plans for user
  applications is a solution to this problem. In this paper, we present such an
  automatic performance optimization approach for scientific applications
  performing collective I/O requests on multidimensional arrays. The approach
  is based on a high level description of the target workload and execution
  environment characteristics, and applies genetic algorithms to select high
  quality I/O plans. We have validated this approach in the Panda parallel I/O
  library. Our performance evaluations on the IBM SP show that this approach
  can select high quality I/O plans under a variety of system conditions with a
  low overhead, and the genetic algorithm-selected I/O plans are in general
  better than the default plans used in Panda.}
}

@InProceedings{chen:collective,
  author = {Ying Chen and Ian Foster and Jarek Nieplocha and Marianne
  Winslett},
  title = {Optimizing Collective {I/O} Performance on Parallel Computers: A
  Multisystem Study},
  booktitle = {Proceedings of the 11th ACM International Conference on
  Supercomputing},
  year = {1997},
  month = {July},
  pages = {28--35},
  publisher = {ACM Press},
  URL =
  {http://www.acm.org/pubs/articles/proceedings/supercomputing/263580/p28-chen/p28-chen.pdf},
  keywords = {collective I/O, multiprocessor file system, parallel I/O,
  pario-bib}
}

@InProceedings{chen:eval,
  author = {Peter Chen and Garth Gibson and Randy Katz and David Patterson},
  title = {An Evaluation of Redundant Arrays of Disks using an {Amdahl 5890}},
  booktitle = {Proceedings of the 1990 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1990},
  month = {May},
  pages = {74--85},
  keywords = {parallel I/O, RAID, disk array, pario-bib},
  comment = {A experimental validation of the performance predictions of
  patterson:raid, plus some extensions. Confirms that RAID level 5 (rotated
  parity) is best for large read/writes, and RAID level 1 (mirroring) is best
  for small reads/writes.}
}

@InProceedings{chen:maxraid,
  author = {Peter M. Chen and David A. Patterson},
  title = {Maximizing Performance in a Striped Disk Array},
  booktitle = {Proceedings of the 17th Annual International Symposium on
  Computer Architecture},
  year = {1990},
  pages = {322--331},
  keywords = {parallel I/O, RAID, disk striping, pario-bib},
  comment = {Choosing the optimal striping unit, i.e., size of contiguous data
  on each disk (bit, byte, block, etc.). A small striping unit is good for
  low-concurrency workloads since it increases the parallelism applied to each
  request, but a large striping unit can support high-concurrency workloads
  where each independent request depends on fewer disks. They do simulations to
  find throughput, and thus to pick the striping unit. They find equations for
  the best compromise striping unit based on the concurrency and the disk
  parameters, or on the disk parameters alone. Some key assumptions may limit
  applicability, but this is not addressed.}
}

@InProceedings{chen:panda,
  author = {Y. Chen and M. Winslett and K. E. Seamons and S. Kuo and Y. Cho and
  M. Subramaniam},
  title = {Scalable Message Passing in {Panda}},
  booktitle = {Proceedings of the Fourth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1996},
  month = {May},
  pages = {109--121},
  publisher = {ACM Press},
  address = {Philadelphia},
  keywords = {parallel I/O, parallel file system, pario-bib},
  abstract = {To provide high performance for applications with a wide variety
  of i/o requirements and to support many different parallel platforms, the
  design of a parallel i/o system must provide for efficient utilization of
  available bandwidth both for disk traffic and for message passing. In this
  paper we discuss the message-passing scalability of the server-directed i/o
  architecture of Panda, a library for synchronized i/o of multidimensional
  arrays on parallel platforms. We show how to improve i/o performance in
  situations where message-passing is a bottleneck, by combining the
  server-directed i/o strategy for highly efficient use of available disk
  bandwidth with new mechanisms to minimize internal communication and
  computation overhead in Panda. We present experimental results that show that
  with these improvements, Panda will provide high i/o performance for a wider
  range of applications, such as applications running with slow interconnects,
  applications performing i/o operations on large numbers of arrays, or
  applications that require drastic data rearrangements as data are moved
  between memory and disk (e.g., array transposition). We also argue that in
  the future, the improved approach to message-passing will allow Panda to
  support applications that are not closely synchronized or that run in
  heterogeneous environments.},
  comment = {see seamons:panda. This paper goes further with some communication
  improvements.}
}

@InProceedings{chen:panda-automatic,
  author = {Y. Chen and M. Winslett and Y. Cho and S. Kuo},
  title = {Automatic parallel {I/O} performance optimization in {Panda}},
  booktitle = {Proceedings of the Eleventh Symposium on Parallel Algorithms and
  Architectures},
  year = {1998},
  pages = {108--118},
  URL = {http://doi.acm.org/10.1145/277651.277677},
  keywords = {parallel I/O, Panda, portability, pario-bib},
  abstract = {Parallel I/O systems typically consist of individual processors,
  communication networks, and a large number of disks. Managing and utilizing
  these resources to meet performance, portability and usability goals of
  applications has become a significant challenge. We believe that a parallel
  I/O system that automatically selects efficient I/O plans for user
  applications is a solution to this problem. In this paper, we present such an
  automatic performance optimization approach for scientific applications
  performing collective I/O requests on multidimensional arrays. Under our
  approach, as optimization engine in a parallel I/O system selects optimal I/O
  plans automatically without human intervention based on a description of the
  application I/O requests and the system configuration. To validate our
  hypothesis, we have built an optimizer that uses a rule-based and randomized
  search-based algorithms to select optimal parameter settings in Panda, a
  parallel I/O library for multidimensional arrays. Our performance results
  obtained from two IBM SPs with significantly different configurations show
  that the Panda optimizer is able to select high-quality I/O plans and deliver
  high performance under a variety of system configurations}
}

@InProceedings{chen:panda-model,
  author = {Y. Chen and M. Winslett and S. Kuo and Y. Cho and M. Subramaniam
  and K. E. Seamons},
  title = {Performance Modeling for the {Panda} Array {I/O} Library},
  booktitle = {Proceedings of Supercomputing '96},
  year = {1996},
  month = {November},
  publisher = {ACM Press and IEEE Computer Society Press},
  URL = {http://www.supercomp.org/sc96/proceedings/SC96PROC/YING/INDEX.HTM},
  keywords = {performance modeling, parallel I/O, pario-bib},
  abstract = {We present an analytical performance model for Panda, a library
  for synchronized i/o of large multidimensional arrays on parallel and
  sequential platforms, and show how the Panda developers use this model to
  evaluate Panda's parallel i/o performance and guide future Panda development.
  The model validation shows that system developers can simplify performance
  analysis, identify potential performance bottlenecks, and study the design
  trade-offs for Panda on massively parallel platforms more easily than by
  conducting empirical experiments. More importantly, we show that the outputs
  of the performance model can be used to help make optimal plans for handling
  application i/o requests, the first step toward our long-term goal of
  automatically optimizing i/o request handling in Panda.},
  comment = {On Web and CDROM only. They derive a detailed but fairly simple
  model of the Panda 2.0.5 parallel I/O library, by carefully enumerating the
  costs involved in a collective I/O operation. They measure Panda, AIX, and
  MPI to obtain parameters, and then they validate the model by comparison with
  the actual Panda implementation running a basic benchmark and an actual
  application. The model predicts the benchmark performance very well, and is
  as much as 20\% off on the performance of the application. They have embedded
  the performance model in a "simulator", which predicts the performance of a
  given sequence of collective I/O requests, and they plan to use it in future
  versions of Panda to formulate I/O plans by predicting the performance
  resulting from several different Panda parameter settings, and choosing the
  best.}
}

@TechReport{chen:raid,
  author = {Peter Chen and Garth Gibson and Randy Katz and David Patterson and
  Martin Schulze},
  title = {Two papers on {RAIDs}},
  year = {1988},
  month = {December},
  number = {UCB/CSD 88/479},
  institution = {UC Berkeley},
  later = {schulze:raid2},
  URL = {http://cs-tr.cs.berkeley.edu/TR/UCB:CSD-88-479},
  keywords = {parallel I/O, RAID, disk array, pario-bib},
  comment = {Basically an updated version of patterson:raid and the
  prepublished version of gibson:failcorrect.}
}

@Article{chen:raid-perf,
  author = {S. Chen and D. Towsley},
  title = {A Performance Evaluation of {RAID} Architectures},
  journal = {IEEE Transactions on Computers},
  year = {1996},
  month = {October},
  volume = {45},
  number = {10},
  pages = {1116--1130},
  publisher = {IEEE Computer Society Press},
  URL = {http://computer.org/tc/tc1996/t1116abs.htm},
  keywords = {parallel I/O, RAID, disk array, pario-bib}
}

@Article{chen:raid-survey,
  author = {Peter M. Chen and Edward K. Lee and Garth A. Gibson and Randy H.
  Katz and David A. Patterson},
  title = {{RAID:} High-performance, Reliable Secondary Storage},
  journal = {ACM Computing Surveys},
  year = {1994},
  month = {June},
  volume = {26},
  number = {2},
  pages = {145--185},
  keywords = {RAID, disk array, parallel I/O, survey, pario-bib},
  comment = {An excellent overview of RAID concepts and technology. It starts
  from the beginning with a discussion of disk hardware, RAID basics, etc, and
  then goes on to discuss some of the more advanced features. They also
  describe a few RAID implementations. Basically, it is a perfect paper to read
  for folks new to RAID.}
}

@InProceedings{chen:raid2,
  author = {Peter M. Chen and Edward K. Lee and Ann L. Drapeau and Ken Lutz and
  Ethan L. Miller and Srinivasan Seshan and Ken Shirriff and David A. Patterson
  and Randy H. Katz},
  title = {Performance and Design Evaluation of the {RAID-II} Storage Server},
  booktitle = {Proceedings of the IPPS~'93 Workshop on Input/Output in Parallel
  Computer Systems},
  year = {1993},
  pages = {110--120},
  address = {Newport Beach, CA},
  later = {drapeau:raid-ii},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {A special back-end box for a Sun4 file server, that hooks a HIPPI
  network through a crossbar to fast memory, a parity engine, and a bunch of
  disks on SCSI. They pulled about 20~MB/s through it, basically disk-limited;
  with more disks they would hit 32--40~MB/s. Much improved over RAID-I, which
  was limited by the memory bandwidth of the Sun4 server.}
}

@InProceedings{chen:raid5stripe,
  author = {Peter Chen and Edward K. Lee},
  title = {Striping in a {RAID} Level~5 Disk Array},
  booktitle = {Proceedings of the 1995 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1995},
  month = {May},
  pages = {136--145},
  keywords = {disk array, striping, RAID, pario-bib}
}

@Article{chen:tuning,
  author = {Ying Chen and Marianne Winslett},
  title = {Automated Tuning of Parallel {I/O} Systems: An Approach to Portable
  {I/O} Performance for Scientific Applications},
  journal = {IEEE Transactions on Software Engineering},
  year = {2000},
  month = {April},
  volume = {26},
  number = {4},
  pages = {362--383},
  keywords = {parallel I/O, pario-bib},
  abstract = {Parallel I/O systems typically consist of individual processors,
  communication networks, and a large number of disks. Managing and utilizing
  these resources to meet performance, portability, and usability goals of high
  performance scientific applications has become a significant challenge. For
  scientists, the problem is exacerbated by the need to retune the I/O portion
  of their code for each supercomputer platform where they obtain access. We
  believe that a parallel I/O system that automatically selects efficient I/O
  plans for user applications is a solution to this problem. The authors
  present such an approach for scientific applications performing collective
  I/O requests on multidimensional arrays. Under our approach, an optimization
  engine in a parallel I/O system selects high quality I/O plans without human
  intervention, based on a description of the application I/O requests and the
  system configuration. To validate our hypothesis, we have built an optimizer
  that uses rule based and randomized search based algorithms to tune parameter
  settings in Panda, a parallel I/O library for multidimensional arrays. Our
  performance results obtained from an IBM SP using an out-of-core matrix
  multiplication application show that the Panda optimizer is able to select
  high quality I/O plans and deliver high performance under a variety of system
  configurations with a small total optimization overhead.}
}

@InProceedings{chervenak:raid,
  author = {Ann L. Chervenak and Randy H. Katz},
  title = {Performance of a Disk Array Prototype},
  booktitle = {Proceedings of the 1991 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1991},
  pages = {188--197},
  keywords = {parallel I/O, disk array, performance evaluation, RAID,
  pario-bib},
  comment = {Measuring the performance of a RAID prototype with a Sun4/280, 28
  disks on 7 SCSI strings, using 4 HBA controllers on a VME bus from the Sun.
  The found lots of bottlenecks really slowed them down. Under Sprite, the
  disks were the bottleneck for single disk I/O, single disk B/W, and string
  I/O. Sprite was a bottleneck for single disk I/O and String I/O. The host
  memory was a bottleneck for string B/W, HBA B/W, overall I/O, and overall
  B/W. With a simpler OS, that saved on data copying, they did better, but were
  still limited by the HBA, SCSI protocol, or the VME bus. Clearly they needed
  more parallelism in the busses and control system.}
}

@InCollection{chervenak:raid-ii,
  author = {Ann L. Chervenak and Ken Shirriff and John H. Hartman and Ethan L.
  Miller and Srinivasan Seshan and Randy H. Katz and Ken Lutz and David A.
  Patterson and Edward K. Lee and Peter M. Chen and Garth A.Gibson},
  title = {{RAID-II}: A High-Bandwidth Network File Server},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {26},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {408--419},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {drapeau:raid-ii},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {RAID, disk array, network file system, parallel I/O, pario-bib},
  comment = {Part of jin:io-book; reformatted version of drapeau:raid-ii.}
}

@Article{chervenak:tertiary,
  author = {Ann L. Chervenak},
  title = {Challenges for tertiary storage in multimedia servers},
  journal = {Parallel Computing},
  year = {1998},
  month = {January},
  volume = {24},
  number = {1},
  pages = {157--176},
  URL = {http://dx.doi.org/10.1016/S0167-8191(97)00120-8},
  keywords = {parallel I/O, multimedia, tertiary storage, memory hierarchy,
  pario-bib},
  comment = {Part of a special issue.}
}

@InProceedings{chiang:graph,
  author = {Yi-Jen Chiang and and Michael T. Goodrich and Edward F. Grove and
  Roberto Tamassia and Darren Erik Vengroff and Jeffrey Scott Vitter},
  title = {External-Memory Graph Algorithms (Extended Abstract)},
  booktitle = {Proceedings of the ACM-SIAM Symposium on Discrete Algorithms
  (SODA~'95)},
  year = {1995},
  month = {January},
  pages = {139--149},
  URL = {ftp://cs.duke.edu/pub/jsv/Papers/CGG95.external_graph.ps.Z},
  keywords = {parallel I/O algorithm, graph algorithm, pario-bib},
  abstract = {We present a collection of new techniques for designing and
  analyzing efficient external-memory algorithms for graph problems and
  illustrate how these techniques can be applied to a wide variety of specific
  problems. Our results include: \begin{itemize} \item {\em
  Proximate-neighboring}. We present a simple method for deriving
  external-memory lower bounds via reductions from a problem we call the
  ``proximate neighbors'' problem. We use this technique to derive non-trivial
  lower bounds for such problems as list ranking, expression tree evaluation,
  and connected components. \item {\em PRAM simulation}. We give methods for
  efficiently simulating PRAM computations in external memory, even for some
  cases in which the PRAM algorithm is not work-optimal. We apply this to
  derive a number of optimal (and simple) external-memory graph algorithms.
  \item {\em Time-forward processing}. We present a general technique for
  evaluating circuits (or ``circuit-like'' computations) in external memory. We
  also use this in a deterministic list ranking algorithm. \item {\em
  Deterministic 3-coloring of a cycle}. We give several optimal methods for
  3-coloring a cycle, which can be used as a subroutine for finding large
  independent sets for list ranking. Our ideas go beyond a straightforward PRAM
  simulation, and may be of independent interest. \item {\em External
  depth-first search}. We discuss a method for performing depth first search
  and solving related problems efficiently in external memory. Our technique
  can be used in conjunction with ideas due to Ullman and Yannakakis in order
  to solve graph problems involving closed semi-ring computations even when
  their assumption that vertices fit in main memory does not hold.
  \end{itemize} \par Our techniques apply to a number of problems, including
  list ranking, which we discuss in detail, finding Euler tours,
  expression-tree evaluation, centroid decomposition of a tree, least-common
  ancestors, minimum spanning tree verification, connected and biconnected
  components, minimum spanning forest, ear decomposition, topological sorting,
  reachability, graph drawing, and visibility representation.}
}

@InProceedings{ching:efficient,
  author = {Avery Ching and Alok Choudhary and Wei-keng Liao and Robert Ross
  and William Gropp},
  title = {Efficient structured data access in parallel file systems},
  booktitle = {Proceedings of the IEEE International Conference on Cluster
  Computing},
  year = {2003},
  month = {December},
  pages = {326--335},
  institution = {Northwestern Univ, Dept Elect \& Comp Engn, Evanston, IL 60208
  USA},
  publisher = {IEEE Computer Society Press},
  address = {Hong Kong, China},
  URL =
  {http://www.ece.northwestern.edu/~wkliao/SciDAC/Publications/cluster2003.pdf},
  keywords = {I/O interface, high-level libraries, PVFS, structured data
  representations, pario-bib},
  abstract = {Parallel scientific applications store and retrieve very large,
  structured datasets. Directly supporting these structured accesses is an
  important step in providing high-performance I/O solutions for these
  applications. High-level interfaces such as HDF5 and Parallel netCDF provide
  convenient APIs for accessing structured datasets, and the MPI-IO interface
  also supports efficient access to structured data. However, parallel file
  systems do not traditionally support such access. In this work, we present an
  implementation of structured data access support in the context of the
  Parallel Virtual File System (PVFS). We call this support "datatype I/O"
  because of its similarity to MPI datatypes. This support is built by using a
  reusable datatype-processing component from the MPICH2 MPI implementation. We
  describe how this component is leveraged to efficiently process structured
  data representations resulting from MPI-IO operations. We quantitatively
  assess the solution using three test applications. We also point to further
  optimizations in the processing path that could be leveraged for even more
  efficient operation.},
  comment = {not read, don't have}
}

@InProceedings{ching:noncontiguous,
  author = {Avery Ching and Alok Choudhary and Kenin Coloma and Wei-keng Liao
  and Robert Ross and William Gropp},
  title = {Noncontiguous {I/O} Accesses Through {MPI-IO}},
  booktitle = {Proceedings of the Third IEEE/ACM International Symposium on
  Cluster Computing and the Grid},
  year = {2003},
  month = {May},
  pages = {104--111},
  publisher = {IEEE Computer Society Press},
  address = {Tokyo, Japan},
  URL =
  {http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190104abs.htm},
  keywords = {parallel I/O, MPI-IO, ROMIO, list I/O, noncontiguous access,
  pario-bib},
  abstract = {I/O performance remains a weakness of parallel computing systems
  today. While this weakness is partly attributed to rapid advances in other
  system components, I/O interfaces available to programmers and the I/O
  methods supported by file systems have traditionally not matched efficiently
  with the types of I/O operations that scientific applications perform,
  particularly noncontiguous accesses. The MPI-IO interface allows for rich
  descriptions of the I/O patterns desired for scientific applications and
  implementations such as ROMIO have taken advantage of this ability while
  remaining limited by underlying file system methods. \par A method of
  noncontiguous data access, list I/O, was recently implemented in the Parallel
  Virtual File System (PVFS). We implement support for this interface in the
  ROMIO MPI-IO implementation. Through a suite of non-contiguous I/O tests we
  compared ROMIO list I/O to current methods of ROMIO noncontiguous access and
  found that the list I/O interface provides performance benefits in many
  noncontiguous cases.}
}

@Article{chiu:smart-disks,
  author = {Steve C. Chiu and Wei-keng Liao and Alok N. Choudhary and Malmut T.
  Kandemir},
  title = {Processor-embedded distributed smart disks for {I/O}-intensive
  workloads: architectures, performance models and evaluation.},
  journal = {Journal of Parallel and Distributed Computing},
  year = {2004},
  month = {March},
  volume = {64},
  number = {3},
  pages = {427--445},
  institution = {Northwestern Univ, Dept Elect \& Comp Engn, Evanston, IL 60208
  USA; Northwestern Univ, Dept Elect \& Comp Engn, Evanston, IL 60208 USA; Penn
  State Univ, Dept Comp Sci \& Engn, University Pk, PA 16802 USA},
  publisher = {USA : Academic Press, 2004},
  copyright = {(c)2004 IEE; Institute for Scientific Information, Inc.},
  URL = {http://portal.acm.org/citation.cfm?id=1005485},
  keywords = {processor-embedded disks, smart disks, analytic performance
  models, I/O workload, pario-bib},
  abstract = {Processor-embedded disks, or smart disks, with their network
  interface controller, can in effect be viewed as processing elements with
  on-disk memory and secondary storage. The data sizes and access patterns of
  today's large I/O-intensive workloads require architectures whose processing
  power scales with increased storage capacity. To address this concern, we
  propose and evaluate disk-based distributed smart storage architectures.
  Based on analytically derived performance models, our evaluation with
  representative workloads show that offloading processing and performing
  point-to-point data communication improve performance over centralized
  architectures. Our results also demonstrate that distributed smart disk
  systems exhibit desirable scalability and can efficiently handle
  I/O-intensive workloads, such as commercial decision support database (TPC-H)
  queries, association rules mining, data clustering, and two-dimensional fast
  Fourier transform, among others. (15 refs.)}
}

@InProceedings{chiueh:tapes,
  author = {{Tzi-cker} Chiueh},
  title = {Performance Optimization for Parallel Tape Arrays},
  booktitle = {Proceedings of the 9th ACM International Conference on
  Supercomputing},
  year = {1995},
  month = {July},
  pages = {375--384},
  publisher = {ACM Press},
  address = {Barcelona},
  keywords = {parallel I/O, tape striping, pario-bib},
  comment = {URL points to tech report version. He points out two problems with
  tape striping: that it is difficult to keep tape drives synchronized due to
  physical variations and to bad-segment remapping in the tape, and that the
  start-up cost is very high making it difficult to get multiple tapes loaded
  and started at the same time. So he proposes a 'triangular interleaving'
  rather than the traditional round-robin interleaving, coupled with lots of
  buffering, to deal with these problems. He also proposes to use different
  striping factors for different files (movies), depending on access
  characteristics. He includes parameters for some tape robots.}
}

@InProceedings{chiung-san:xdac,
  author = {Lee Chiung-San and Parng Tai-Ming and Lee Jew-Chin and Tsai
  Cheng-Nan and Farn Kwo-Jean},
  title = {Performance analysis of the {XDAC} disk array system},
  booktitle = {Proceedings of the 1994 IEEE Symposium on Parallel and
  Distributed Processing},
  year = {1994},
  pages = {620--627},
  publisher = {IEEE Computer Society Press},
  keywords = {disk array, performance evaluation, analytical model, parallel
  I/O, pario-bib},
  abstract = {The paper presents an analytical model of a whole disk array
  architecture, XDAC, which consists of several major subsystems and features:
  the two-dimensional array structure; IO-bus with split transaction protocol;
  and cache for processing multiple I/O requests in parallel. Our modelling
  approach is based on a subsystem access time per request (SATPR) concept, in
  which we model for each subsystem the mean access time per disk array
  request. The model is fed with a given set of representative workload
  parameters and then used to conduct performance analysis for exploring the
  impact of fork/join synchronization as well as evaluating some architectural
  design issues of the XDAC system. Moreover, by comparing the SATPRs of
  subsystems, we can identify the bottleneck for performance improvements.}
}

@InProceedings{cho:fine-grained,
  author = {Yong Cho and Marianne Winslett and Ying Chen and Szu-wen Kuo},
  title = {Parallel {I/O} Performance of Fine Grained Data Distributions},
  booktitle = {Proceedings of the Seventh IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1998},
  month = {July},
  pages = {163--170},
  publisher = {IEEE Computer Society Press},
  URL = {http://www.computer.org/proceedings/hpdc/8579/85790163abs.htm},
  keywords = {parallel I/O, pario-bib},
  abstract = {Fine grained data distributions are widely used to balance
  computational loads across compute processes in parallel scientific
  applications. When a fine grained data distribution is used in memory,
  performance of I/O intensive applications can be limited not only by disk
  speed but also by message passing, because a large number of small messages
  may be generated by the implementation strategy used in the underlying
  parallel file system or parallel I/O library. Combining (or packetizing) a
  set of small messages into a large message is generally known to speed up
  parallel I/O. However, overall I/O performance is affected not only by small
  messages but also by other factors like cyclic block size and interconnect
  characteristics. We describe small message combination and communication
  scheduling for fine grained data distributions in the Panda parallel I/O
  library and analyze I/O performance on parallel platforms having different
  interconnects: IBM SP2, IBM workstation cluster connected by FDDI and Pentium
  II cluster connected by Myrinet.}
}

@InProceedings{cho:local,
  author = {Yong Cho and Marianne Winslett and Mahesh Subramaniam and Ying Chen
  and Szu-wen Kuo and Kent E. Seamons},
  title = {Exploiting Local Data in Parallel Array {I/O} on a Practical Network
  of Workstations},
  booktitle = {Proceedings of the Fifth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1997},
  month = {November},
  pages = {1--13},
  publisher = {ACM Press},
  address = {San Jose, CA},
  URL = {http://doi.acm.org/10.1145/266220.266221},
  keywords = {parallel I/O, distributed system, pario-bib},
  abstract = {A cost-effective way to run a parallel application is to use
  existing workstations connected by a local area network such as Ethernet or
  FDDI. In this paper, we present an approach for parallel I/O of
  multidimensional arrays on small networks of workstations with a shared-media
  interconnect, using the Panda I/O library. \par In such an environment, the
  message passing throughput per node is lower than the throughput obtainable
  from a fast disk and it is not easy for users to determine the configuration
  which will yield the best I/O performance. \par We introduce an I/O strategy
  that exploits local data to reduce the amount of data that must be shipped
  across the network, present experimental results, and analyze the results
  using an analytical performance model and predict the best choice of I/O
  parameters. \par Our experiments show that the new strategy results in a
  factor of 1.2--2.1 speedup in response time compared to the Panda version
  originally developed for the IBM SP2, depending on the array sizes,
  distributions and compute and I/O node meshes. Further, the performance model
  predicts the results within a 13\% margin of error.},
  comment = {They examine a system that supports nodes that are both compute
  and I/O nodes. The assumption is that the application is writing data to a
  new file, and does not care to which disks the data goes. They are trying to
  decide which nodes should be used for I/O, given the distribution of data on
  compute nodes and the distribution desired across disks. They use a Hungarian
  algorithm to solve a weighted optimization problem on a bipartite graph
  connecting I/O nodes to compute nodes, in an attempt to minimize the data
  flow across the network. But there is no attempt to make a decision that
  might be sensible for a future read operation that may want to read in a
  different pattern.}
}

@Article{choudhary:jmanagement,
  author = {A. Choudhary and M. Kandemir and J. No and G. Memik and X. Shen and
  W. Liao and H. Nagesh and S. More and V. Taylor and R. Thakur and and R.
  Stevens},
  title = {Data Management for Large-Scale Scientific Computations in High
  Performance Distributed Systems},
  journal = {Cluster Computing},
  year = {2000},
  volume = {3},
  number = {1},
  pages = {45--60},
  publisher = {Baltzer Science Publishers},
  earlier = {choudhary:management},
  URL = {http://www.baltzer.nl/cluster/contents/2000/3-1.html#clus067},
  keywords = {cluster computing, scientific computing, parallel I/O, data
  management, pario-bib},
  abstract = {With the increasing number of scientific applications
  manipulating huge amounts of data, effective high-level data management is an
  increasingly important problem. Unfortunately, so far the solutions to the
  high-level data management problem either require deep understanding of
  specific storage architectures and file layouts (as in high-performance file
  storage systems) or produce unsatisfactory I/O performance in exchange for
  ease-of-use and portability (as in relational DBMSs). In this paper we
  present a novel application development environment which is built around an
  active meta-data management system (MDMS) to handle high-level data in an
  effective manner. The key components of our three-tiered architecture are
  user application, the MDMS, and a hierarchical storage system (HSS). Our
  environment overcomes the performance problems of pure database-oriented
  solutions, while maintaining their advantages in terms of ease-of-use and
  portability. The high levels of performance are achieved by the MDMS, with
  the aid of user-specified, performance-oriented directives. Our environment
  supports a simple, easy-to-use yet powerful user interface, leaving the task
  of choosing appropriate I/O techniques for the application at hand to the
  MDMS. We discuss the importance of an active MDMS and show how the three
  components of our environment, namely the application, the MDMS, and the HSS,
  fit together. We also report performance numbers from our ongoing
  implementation and illustrate that significant improvements are made possible
  without undue programming effort.}
}

@InProceedings{choudhary:management,
  author = {A. Choudhary and M. Kandemir and H. Nagesh and J. No and X. Shen
  and V. Taylor and S. More and R. Thakur},
  title = {Data Management for Large-Scale Scientific Computations in High
  Performance Distributed Systems},
  booktitle = {Proceedings of the Eighth IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1999},
  month = {August},
  pages = {263--272},
  publisher = {IEEE Computer Society Press},
  address = {Redondo Beach, CA},
  later = {choudhary:jmanagement},
  URL = {http://computer.org/conferen/proceed/hpdc/0287/02870042abs.htm},
  keywords = {cluster computing, scientific computing, parallel I/O, data
  management, pario-bib},
  abstract = {With the increasing number of scientific applications
  manipulating huge amounts of data, effective data management is an
  increasingly important problem. Unfortunately, so far the solutions to this
  data management problem either require deep understanding of specific storage
  architectures and file layouts (as in high-performance file systems) or
  produce unsatisfactory I/O performance in exchange for ease-of-use and
  portability (as in relational DBMSs).\par In this paper we present a new
  environment which is built around an active meta-data management system
  (MDMS). The key components of our three-tiered architecture are user
  application, the MDMS, and a hierarchical storage system (HSS). Our
  environment overcomes the performance problems of pure database-oriented
  solutions, while maintaining their advantages in terms of ease-of-use and
  portability.\par The high levels of performance are achieved by the MDMS,
  with the aid of user-specified directives. Our environment supports a simple,
  easy-to-use yet powerful user interface, leaving the task of choosing
  appropriate I/O techniques to the MDMS. We discuss the importance of an
  active MDMS and show how the three components, namely application, the MDMS,
  and the HSS, fit together. We also report performance numbers from our
  initial implementation and illustrate that significant improvements are made
  possible without undue programming effort.},
  comment = {They argue that existing parallel file systems are too low-level,
  they have their own set of I/O calls (non-portable), and policies are
  generally hard-coded into the system. Databases provide a portable layer on
  top of the file system, but they cannot provide high performance. They
  propose to "combine the advantages of file systems and databases, while
  avoiding their respective disadvantages." Their system is composed of a user
  program, a meta-data management system (MDMS), and a heirarchical storage
  system (HSS). The user program will query the MDMS to learn where in the HSS
  their data reside, what the performance of the storage system is, information
  about the acc data from the storage system, etc...}
}

@TechReport{choudhary:passion,
  author = {Alok Choudhary and Rajesh Bordawekar and Michael Harry and Rakesh
  Krishnaiyer and Ravi Ponnusamy and Tarvinder Singh and Rajeev Thakur},
  title = {{PASSION:} Parallel And Scalable Software for Input-Output},
  year = {1994},
  month = {September},
  number = {SCCS-636},
  institution = {ECE Dept., NPAC and CASE Center, Syracuse University},
  later = {thakur:jpassion},
  URL = {ftp://erc.cat.syr.edu/ece/choudhary/PASSION/passion_report.ps.Z},
  keywords = {parallel I/O, out-of-core, pario-bib},
  comment = {This TR overviews the PASSION project, and all its components:
  two-phase access, out-of-core support for structured and unstructured
  problems, data sieving, prefetching, caching, compiler and language support,
  file system support, virtual parallel file system, and parallel pipes. They
  reference many of their related papers in an extensive bibliography. See also
  singh:adopt, jadav:ioschedule, thakur:passion, thakur:runtime,
  bordawekar:efficient, thakur:out-of-core, delrosario:prospects,
  delrosario:two-phase, bordawekar:primitives, bordawekar:delta-fs.}
}

@InProceedings{choudhary:passion-paragon,
  author = {Alok Choudhary and Rajesh Bordawekar and Sachin More and K. Sivaram
  and Rajeev Thakur},
  title = {{PASSION} Runtime Library for the {Intel Paragon}},
  booktitle = {Proceedings of the Intel Supercomputer User's Group Conference},
  year = {1995},
  month = {June},
  later = {thakur:jpassion},
  URL = {ftp://erc.cat.syr.edu/ece/choudhary/PASSION/isug95-passion.ps.Z},
  keywords = {parallel I/O, runtime library, pario-bib},
  abstract = {We are developing a runtime library which provides a number of
  routines to perform the I/O required in parallel applications in an efficient
  and convenient manner. This is part of a project called PASSION, which aims
  to provide software support for high-performance parallel I/O at the
  compiler, runtime and file system levels. The PASSION Runtime Library uses a
  high-level interface which makes it easy for the user to specify the I/O
  required in the program. The user only needs to specify what portion of the
  data structure needs to read from or written to the file, and the PASSION
  routines will perform all the necessary I/O efficiently. This paper gives an
  overview of the PASSION Runtime Library and describes in detail its
  high-level interface.},
  comment = {See also choudhary:passion.}
}

@Article{choudhary:sdcr,
  author = {Alok Choudhary and David Kotz},
  title = {Large-Scale File Systems with the Flexibility of Databases},
  journal = {ACM Computing Surveys},
  year = {1996},
  month = {December},
  volume = {28A},
  number = {4},
  publisher = {ACM Press},
  copyright = {ACM},
  note = {Position paper for the Working Group on Storage I/O for Large-Scale
  Computing, ACM Workshop on Strategic Directions in Computing Research.
  Available on-line only, at http://doi.acm.org/10.1145/242224.242488.},
  URL = {http://doi.acm.org/10.1145/242224.242488},
  keywords = {file system, database, parallel I/O, pario-bib, dfk},
  comment = {A position paper for the Strategic Directions in Computer Research
  workshop at MIT in June 1996. See gibson:sdcr and wegner:sdcr.}
}

@TechReport{choudhary:sio-language,
  author = {Alok Choudhary and Ian Foster and Geoffrey Fox and Ken Kennedy and
  Carl Kesselman and Charles Koelbel and Joel Saltz and Marc Snir},
  title = {Languages, Compilers, and Runtime Systems Support for Parallel
  Input-Output},
  year = {1994},
  number = {CCSF-39},
  institution = {Scalable I/O Initiative},
  address = {Caltech Concurrent Supercomputing Facilities, Caltech},
  URL = {http://www.cacr.caltech.edu/SIO/pubs/SIO_comp.ps},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {Language extensions to support parallel I/O. Compiler
  optimizations. Runtime library to support the compiler and interface with the
  native file system. Compiler would develop a mapping of data to the processor
  memories and to the disks, and then decide on I/O schedules to move data
  around, overlap I/O with computation, even move computation around to best
  fit what is available in memory at a given time. It can also help with
  checkpointing. Compiler should pass info to the runtime system, which in turn
  may need to pass info to the file system, to help with optimization. I/O
  scheduling includes reordering accesses; they even go so far as to propose
  doing seek optimization in the runtime library. Support for collective I/O.
  Extension of MPI to I/O, to take advantage of its support for asynchrony,
  scatter-gather, etc.\ On the way, they hope to work with the FS people to
  decide on the functional requirements of the file system. See also
  poole:sio-survey, bagrodia:sio-character, bershad:sio-os.}
}

@InProceedings{chung-sheng:arrays,
  author = {Li Chung-Sheng and Chen Ming-Syan and P.~S. Yu and Hsiao Hui-I},
  title = {Combining replication and parity approaches for fault-tolerant disk
  arrays},
  booktitle = {Proceedings of the 1994 IEEE Symposium on Parallel and
  Distributed Processing},
  year = {1994},
  pages = {360--367},
  publisher = {IEEE Computer Society Press},
  keywords = {fault tolerance, disk array, replication, declustering, parallel
  I/O, pario-bib},
  abstract = {We explore the method of combining the replication and parity
  approaches to tolerate multiple disk failures in a disk array. In addition to
  the conventional mirrored and chained declustering methods, a method based on
  the hybrid of mirrored-and-chained declustering is explored. A performance
  study that explores the effect of combining replication and parity approaches
  is conducted. It is experimentally shown that the proposed approach can lead
  to the most cost-effective solution if the objective is to sustain the same
  load as before the failures.},
  comment = {Consider hybrid chained and mirrored declustering.}
}

@InProceedings{clark:molecular,
  author = {Terry W. Clark and L. Ridgway Scott and Stanislaw Wlodek and J.
  Andrew McCammon},
  title = {{I/O} Limitations in Parallel Molecular Dynamics},
  booktitle = {Proceedings of Supercomputing '95},
  year = {1995},
  publisher = {IEEE Computer Society Press},
  address = {San Diego, CA},
  URL = {http://www.supercomp.org/sc95/proceedings/524_TCLA/SC95.HTM},
  keywords = {parallel I/O application, molecular dynamics, pario-bib},
  abstract = {We discuss data production rates and their impact on the
  performance of scientific applications using parallel computers. On one hand,
  too high rates of data production can be overwhelming, exceeding logistical
  capacities for transfer, storage and analysis. On the other hand, the rate
  limiting step in a computationally-based study should be the human-guided
  analysis, not the calculation. We present performance data for a biomolecular
  simulation of the enzyme, acetylcholinesterase, which uses the parallel
  molecular dynamics program EulerGROMOS. The actual production rates are
  compared against a typical time frame for results analysis where we show that
  the rate limiting step is the simulation, and that to overcome this will
  require improved output rates.},
  comment = {Note proceedings only on CD-ROM or WWW.}
}

@Article{clement:overlap,
  author = {Mark J. Clement and Michael J. Quinn},
  title = {Overlapping Computations, Communications and {I/O} in Parallel
  Sorting},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1995},
  month = {August},
  volume = {28},
  number = {2},
  pages = {162--172},
  publisher = {Academic Press},
  keywords = {parallel I/O algorithm, sorting, pario-bib},
  comment = {They present a new parallel sorting algorithm that allows overlap
  between disk, network, and processor. By pipelining the tasks, they can
  double the speed of sorting; best results, of course, when these three
  components take approximately equal time. The disk I/O is really only used to
  load the initial data set and write the output data set, rather than being
  used for an external sorting scheme. They obtain their gains by overlapping
  that disk I/O with the communication and processing.}
}

@InProceedings{coelho:hpf-io,
  author = {Fabien Coelho},
  title = {Compilation of {I/O} Communications for {HPF}},
  booktitle = {Proceedings of the Fifth Symposium on the Frontiers of Massively
  Parallel Computation},
  year = {1995},
  pages = {102--109},
  keywords = {parallel I/O, HPF, compiler, pario-bib}
}

@Misc{colarelli:allocate,
  author = {Dennis Colarelli and Gene Schumacher},
  title = {New Strategy for File Allocation on Multi-device Systems},
  year = {1993},
  howpublished = {Supercomputing '93 poster session},
  keywords = {file system, parallel I/O, disk layout, pario-bib},
  comment = {These two guys from NCAR redid the block allocation strategy
  routine on the Cray. Current strategy uses round-robin among the disks, using
  a different disk for each allocation request. Each request looks for blocks
  on that disk, until it is satisfied or space runs out, and then goes to the
  next disk. It uses a free-block bitmap to find the blocks. Problem: too many
  extents, not enough contiguity. These guys tried first-bit and best-fit from
  all extents on all disks. First-fit had faster allocation time, of course,
  and both had much lower file fragmentation. They also used the vector
  hardware to search the bitmap for non-zero words.}
}

@InProceedings{coleman:bottleneck,
  author = {Samuel S. Coleman and Richard W. Watson},
  title = {New Architectures to Reduce {I/O} Bottlenecks in High-Performance
  Systems},
  booktitle = {Proceedings of the Twenty-Sixth Annual Hawaii International
  Conference on System Sciences},
  year = {1993},
  volume = {I},
  pages = {5--14},
  keywords = {parallel architecture, parallel I/O, pario-bib},
  comment = {They argue for network-attached devices, and for making I/O
  devices and networks, instead of CPUs, more the center of architectural
  design.}
}

@InProceedings{coloma:caching,
  author = {Kenin Coloma and Alok Choudhary and Wei-keng Liao and Lee Ward and
  Eric Russell and Neil Pundit},
  title = {Scalable high-level caching for parallel {I/O}},
  booktitle = {Proceedings of the International Parallel and Distributed
  Processing Symposium},
  year = {2004},
  month = {April},
  pages = {96b},
  publisher = {Los Alamitos, CA, USA : IEEE Comput. Soc, 2004},
  copyright = {(c)2005 IEE},
  address = {Santa Fe, NM},
  URL =
  {http://csdl.computer.org/comp/proceedings/ipdps/2004/2132/01/213210096babs.htm},
  keywords = {client-side file caching, file locking, MPI, pario-bib},
  abstract = {In order for I/O systems to achieve high performance in a
  parallel environment, they must either sacrifice client-side file caching, or
  keep caching and deal with complex coherency issues. The most common
  technique for dealing with cache coherency in multi-client file caching
  environments uses file locks to bypass the client-side cache. Aside from
  effectively disabling cache usage, file locking is sometimes unavailable on
  larger systems. \par The high-level abstraction layer of MPI allows us to
  tackle cache coherency with additional information and coordination without
  using file locks. By approaching the cache coherency issue further up, the
  underlying I/O accesses can be modified in such a way as to ensure access to
  coherent data while satisfying the user's I/O request. We can effectively
  exploit the benefits of a file system's client-side cache while minimizing
  its management costs.}
}

@InProceedings{colvin:vic,
  author = {Alex Colvin and Thomas H. Cormen},
  title = {{ViC*}: A Compiler for Virtual-Memory {C*}},
  booktitle = {Proceedings of the Third International Workshop on High-Level
  Parallel Programming Models and Supportive Environments (HIPS~'98)},
  year = {1998},
  month = {March},
  pages = {23--33},
  publisher = {IEEE Computer Society Press},
  copyright = {IEEE},
  earlier = {colvin:vic-tr},
  keywords = {compiler, data-parallel programming, programming language,
  virtual memory, out of core, parallel I/O, pario-bib},
  abstract = {This paper describes the functionality of ViC*, a compiler for a
  variant of the data-parallel language C* with support for out-of-core data.
  The compiler translates C* programs with shapes declared outofcore, which
  describe parallel data stored on disk. The compiler output is a SPMD-style
  program in standard C with I/O and library calls added to efficiently access
  out-of-core parallel data. The ViC* compiler also applies several program
  transformations to improve out-of-core data access.}
}

@TechReport{colvin:vic-tr,
  author = {Alex Colvin and Thomas H. Cormen},
  title = {{ViC*}: A Compiler for Virtual-Memory {C*}},
  year = {1997},
  month = {November},
  number = {PCS-TR97-323},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  later = {colvin:vic},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/156/},
  keywords = {compiler, data-parallel programming, programming language,
  virtual memory, out of core, parallel I/O, pario-bib},
  abstract = {This paper describes the functionality of ViC*, a compiler for a
  variant of the data-parallel language C* with support for out-of-core data.
  The compiler translates C* programs with shapes declared outofcore, which
  describe parallel data stored on disk. The compiler output is a SPMD-style
  program in standard C with I/O and library calls added to efficiently access
  out-of-core parallel data. The ViC* compiler also applies several program
  transformations to improve out-of-core data layout and access.}
}

@Book{convex:exemplar,
  key = {CCC},
  title = {{Convex Exemplar} Scalable Parallel Processing System},
  year = {1994},
  publisher = {Convex Computer Corporation},
  note = {Order number 080-002293-000},
  keywords = {parallel computer architecture, shared memory, parallel I/O,
  pario-bib},
  comment = {The Convex Exemplar connects {\em hypernodes}, which are basically
  SMP nodes built from 8 HP PA-RISC CPUs, lots of RAM, and a crossbar switch,
  with their own implementation of the SCI interconnect. Hierarchical caching
  supports a global shared physical address space. Each hypernode can also have
  an I/O adapter, to which they can attach lots of different I/O devices. The
  I/O adapter has the capability to DMA directly into any memory in the system,
  even on other hypernodes. Each hypernode runs its own file-system server,
  which manages UNIX file systems on the devices of that hypernode. Striped
  file systems are supported in software, although it's not clear if they can
  stripe across hypernodes, or only within hypernodes, ie, whether (striped)
  file systems can span multiple hypernodes.}
}

@Manual{convex:stripe,
  title = {{CONVEX UNIX} Programmer's Manual, Part I},
  edition = {Eighth},
  year = {1988},
  month = {October},
  organization = {CONVEX Computer Corporation},
  address = {Richardson, Texas},
  keywords = {parallel I/O, parallel file system, striping, pario-bib},
  comment = {Implementation of striped disks on the CONVEX. Uses partitions of
  normal device drivers. Kernel data structure knows about the interleaving
  granularity, the set of partitions, sizes, etc.}
}

@Article{cook:simd-jpeg,
  author = {Gregory W. Cook and Edward J. Delp},
  title = {An Investigation of Scalable {SIMD I/O} Techniques with Application
  to Parallel {JPEG} Compression},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1995},
  month = {November},
  volume = {30},
  number = {2},
  pages = {111--128},
  publisher = {Academic Press},
  keywords = {multimedia, compression, data-parallel computing, Maspar,
  parallel I/O, pario-bib}
}

@InProceedings{copeland:bubba,
  author = {George Copeland and William Alexander and Ellen Boughter and Tom
  Keller},
  title = {Data Placement in {Bubba}},
  booktitle = {Proceedings of the ACM SIGMOD International Conference on
  Management of Data},
  year = {1988},
  month = {June},
  pages = {99--108},
  publisher = {ACM Press},
  address = {Chicago, IL},
  keywords = {parallel I/O, database, disk caching, pario-bib},
  comment = {A database machine. Experimental/analytical model of a placement
  algorithm that declusters relations across several parallel, independent
  disks. The declustering is done on a subset of the disks, and the choices
  involved are the number of disks to decluster onto, which relations to put
  where, and whether a relation should be cache-resident. Communications
  overhead limits the usefulness of declustering in some cases, depending on
  the workload. See boral:bubba.}
}

@InCollection{corbett:bmpi-overview,
  author = {Peter Corbett and Dror Feitelson and Sam Fineberg and Yarsun Hsu
  and Bill Nitzberg and Jean-Pierre Prost and Marc Snir and Bernard Traversat
  and Parkson Wong},
  title = {Overview of the {MPI-IO} Parallel {I/O} Interface},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {32},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {477--487},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {corbett:mpi-overview},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, multiprocessor file system interface, pario-bib},
  comment = {Part of jin:io-book, revised version of corbett:mpi-overview.}
}

@InCollection{corbett:bvesta,
  author = {Peter F. Corbett and Dror G. Feitelson},
  title = {The {Vesta} Parallel File System},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {20},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {285--308},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {corbett:jvesta},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {multiprocessor file system, Vesta, parallel I/O, pario-bib},
  comment = {Part of jin:io-book; reformatted version of corbett:jvesta.}
}

@Article{corbett:jvesta,
  author = {Peter F. Corbett and Dror G. Feitelson},
  title = {The {Vesta} Parallel File System},
  journal = {ACM Transactions on Computer Systems},
  year = {1996},
  month = {August},
  volume = {14},
  number = {3},
  pages = {225--264},
  publisher = {ACM Press},
  earlier = {corbett:vesta},
  later = {corbett:bvesta},
  URL =
  {http://www.acm.org/pubs/citations/journals/tocs/1996-14-3/p225-corbett/#abstract},
  keywords = {multiprocessor file system, Vesta, parallel I/O, pario-bib},
  abstract = {The Vesta parallel file system is designed to provide parallel
  file access to application programs running on multicomputers with parallel
  I/O subsystems. Vesta uses a new abstraction of files: a file is not a
  sequence of bytes, but rather it can be partitioned into multiple disjoint
  sequences that are accessed in parallel. The partitioning--which can also be
  changed dynamically--reduces the need for synchronization and coordination
  during the access. Some control over the layout of data is also provided, so
  the layout can be matched with the anticipated access patterns. The system is
  fully implemented and forms the basis for the AIX Parallel I/O File System on
  the IBM SP2. The implementation does not compromise scalability or
  parallelism. In fact, all data accesses are done directly to the I/O node
  that contains the requested data, without any indirection or access to shared
  metadata. Disk mapping and caching functions are confined to each I/O node,
  so there is no need to keep data coherent across nodes. Performance
  measurements shown good scalability with increased resources. Moreover,
  different access patterns are show to achieve similar performance.},
  comment = {See also corbett:pfs, corbett:vesta*, feitelson:pario. This is the
  ultimate Vesta reference. There seem to be only a few small things that are
  completely new over what's been published elsewhere, although this
  presentation is much more complete and polished.}
}

@TechReport{corbett:mpi-io2,
  author = {Peter Corbett and Dror Feitelson and Yarsun Hsu and Jean-Pierre
  Prost and Marc Snir and Sam Fineberg and Bill Nitzberg and Bernard Traversat
  and Parkson Wong},
  title = {{MPI-IO:} A Parallel File {I/O} Interface for {MPI}},
  year = {1994},
  month = {November},
  number = {RC~19841 (87784)},
  institution = {IBM T.J. Watson Research Center},
  note = {Version 0.2},
  later = {corbett:mpi-io3},
  keywords = {parallel I/O, message-passing, multiprocesor file system
  interface, pario-bib},
  comment = {Superceded by mpi-ioc:mpi-io5. See the MPI-IO Web page at
  http://parallel.nas.nasa.gov/MPI-IO/.}
}

@TechReport{corbett:mpi-io3,
  author = {Peter Corbett and Dror Feitelson and Yarsun Hsu and Jean-Pierre
  Prost and Marc Snir and Sam Fineberg and Bill Nitzberg and Bernard Traversat
  and Parkson Wong},
  title = {{MPI-IO:} A Parallel File {I/O} Interface for {MPI}},
  year = {1995},
  month = {January},
  number = {NAS-95-002},
  institution = {NASA Ames Research Center},
  note = {Version 0.3},
  earlier = {corbett:mpi-io2},
  later = {corbett:mpi-io4},
  URL =
  {http://science.nas.nasa.gov/Pubs/TechReports/NASreports/NAS-95-002/index.html},
  keywords = {parallel I/O, message-passing, multiprocesor file system
  interface, pario-bib},
  comment = {The goal is to design a standard file interface for SPMD
  message-passing programs. An earlier version of this specification was
  prost:mpi-io. Superceded by mpi-ioc:mpi-io5. See also the general MPI I/O web
  page at http://parallel.nas.nasa.gov/MPI-IO/.}
}

@Misc{corbett:mpi-io4,
  author = {Peter Corbett and Yarsun Hsu and Jean-Pierre Prost and Marc Snir
  and Sam Fineberg and Bill Nitzberg and Bernard Traversat and Parkson Wong and
  Dror Feitelson},
  title = {{MPI-IO:} A Parallel File {I/O} Interface for {MPI}},
  year = {1995},
  month = {December},
  institution = {IBM T. J. Watson Research Center and NASA Ames Research Center
  and The Hebrew University},
  note = {Version 0.4},
  earlier = {corbett:mpi-io3},
  later = {mpi-ioc:mpi-io5},
  keywords = {parallel I/O, message-passing, multiprocesor file system
  interface, pario-bib},
  comment = {Superceded by mpi-ioc:mpi-io5. See the MPI-IO Web page at
  http://parallel.nas.nasa.gov/MPI-IO/.}
}

@InProceedings{corbett:mpi-overview,
  author = {Peter Corbett and Dror Feitelson and Sam Fineberg and Yarsun Hsu
  and Bill Nitzberg and Jean-Pierre Prost and Marc Snir and Bernard Traversat
  and Parkson Wong},
  title = {Overview of the {MPI-IO} Parallel {I/O} Interface},
  booktitle = {Proceedings of the IPPS~'95 Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1995},
  month = {April},
  pages = {1--15},
  later = {corbett:mpi-overview-book},
  keywords = {parallel I/O, multiprocessor file system interface, pario-bib},
  abstract = {Thanks to MPI, writing portable message passing parallel programs
  is almost a reality. One of the remaining problems is file I/O. Although
  parallel file systems support similar interfaces, the lack of a standard
  makes developing a truly portable program impossible. It is not feasible to
  develop large scientific applications from scratch for each generation of
  parallel machine, and, in the scientific world, a program is not considered
  truly portable unless it not only compiles, but also runs efficiently. \par
  The MPI-IO interface is being proposed as an extension to the MPI standard to
  fill this need. MPI-IO supports a high-level interface to describe the
  partitioning of file data among processes, a collective interface describing
  complete transfers of global data structures between process memories and
  files, asynchronous I/O operations, allowing computation to be overlapped
  with I/O, and optimization of physical file layout on storage devices
  (disks).},
  comment = {A more readable explanation of MPI-IO than the proposed-standard
  document corbett:mpi-io3. See polished book version,
  corbett:mpi-overview-book. See also the slides presented at IOPADS}
}

@InCollection{corbett:mpi-overview-book,
  author = {Peter Corbett and Dror Feitelson and Sam Fineberg and Yarsun Hsu
  and Bill Nitzberg and Jean-Pierre Prost and Marc Snir and Bernard Traversat
  and Parkson Wong},
  title = {Overview of the {MPI-IO} Parallel {I/O} Interface},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {5},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {127--146},
  publisher = {Kluwer Academic Publishers},
  earlier = {corbett:mpi-overview},
  keywords = {parallel I/O, file system interface, pario-bib},
  abstract = {Thanks to MPI, writing portable message passing parallel programs
  is almost a reality. One of the remaining problems is file I/O. Although
  parallel file systems support similar interfaces, the lack of a standard
  makes developing a truly portable program impossible. It is not feasible to
  develop large scientific applications from scratch for each generation of
  parallel machine, and, in the scientific world, a program is not considered
  truly portable unless it not only compiles, but also runs efficiently. \par
  The MPI-IO interface is being proposed as an extension to the MPI standard to
  fill this need. MPI-IO supports a high-level interface to describe the
  partitioning of file data among processes, a collective interface describing
  complete transfers of global data structures between process memories and
  files, asynchronous I/O operations, allowing computation to be overlapped
  with I/O, and optimization of physical file layout on storage devices
  (disks).},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@Article{corbett:pfs,
  author = {Peter F. Corbett and Dror G. Feitelson and Jean-Pierre Prost and
  George S. Almasi and Sandra Johnson Baylor and Anthony S. Bolmarcich and
  Yarsun Hsu and Julian Satran and Marc Snir and Robert Colao and Brian Herr
  and Joseph Kavaky and Thomas R. Morgan and Anthony Zlotek},
  title = {Parallel File Systems for the {IBM SP} Computers},
  journal = {IBM Systems Journal},
  year = {1995},
  month = {January},
  volume = {34},
  number = {2},
  pages = {222--248},
  keywords = {parallel file system, parallel I/O, Vesta, pario-bib},
  abstract = {Parallel computer architectures require innovative software
  solutions to utilize their capabilities. This statement is true for system
  software no less than for application programs. File system development for
  the IBM SP product line of computers started with the Vesta research project,
  which introduced the ideas of parallel access to partitioned files. This
  technology was then integrated with a conventional Advanced Interactive
  Executive (AIX) environment to create the IBM AIX Parallel I/O File System
  product. We describe the design and implementation of Vesta, including user
  interfaces and enhancements to the control environment needed to run the
  system. Changes to the basic design that were made as part of the AIX
  Parallel I/O File System are identified and justified.},
  comment = {Probably the most authoritative Vesta/PIOFS paper yet. Good
  description of the system, motivations, etc. Not as much detail as some, like
  corbett:vesta-di.}
}

@InProceedings{corbett:rdp,
  author = {Peter Corbett and Bob English and Atul Goel and Tomislav Grcanac
  and Steven Kleiman and James Leong and Sunitha Sankar},
  title = {Row-Diagonal Parity for Double Disk Failure Correction},
  booktitle = {Proceedings of the USENIX FAST '04 Conference on File and
  Storage Technologies},
  year = {2004},
  month = {March},
  pages = {1--14},
  organization = {Network Appliance, Inc.},
  publisher = {USENIX Association},
  address = {San Francisco, CA},
  URL = {http://www.usenix.org/events/fast04/tech/corbett.html},
  keywords = {fault tolerance, disk failures, algorithms, row-diagonal parity,
  RAID, pario-bib},
  abstract = {Row-Diagonal Parity (RDP) is a new algorithm for protecting
  against double disk failures. It stores all data unencoded, and uses only
  exclusive-or operations to compute parity. RDP is provably optimal in
  computational complexity, both during construction and reconstruction. Like
  other algorithms, it is optimal in the amount of redundant information stored
  and accessed. RDP works within a single stripe of blocks of sizes normally
  used by file systems, databases and disk arrays. It can be utilized in a
  fixed (RAID-4) or rotated (RAID-5) parity placement style. It is possible to
  extend the algorithm to encompass multiple RAID-4 or RAID-5 disk arrays in a
  single RDP disk array. It is possible to add disks to an existing RDP array
  without recalculating parity or moving data. Implementation results show that
  RDP performance can be made nearly equal to single parity RAID-4 and RAID-5
  performance.},
  comment = {Awarded best paper.}
}

@Misc{corbett:sio-api1.0,
  author = {Peter F. Corbett and Jean-Pierre Prost and Chris Demetriou and
  Garth Gibson and Erik Reidel and Jim Zelenka and Yuqun Chen and Ed Felten and
  Kai Li and John Hartman and Larry Peterson and Brian Bershad and Alec Wolman
  and Ruth Aydt},
  title = {Proposal for a Common Parallel File System Programming Interface},
  year = {1996},
  month = {September},
  howpublished = {WWW http://www.cs.arizona.edu/sio/api1.0.ps},
  note = {Version 1.0.},
  URL = {http://www.cs.arizona.edu/sio},
  keywords = {parallel I/O, multiprocessor file system interface, pario-bib},
  comment = {Specs of the proposed SIO low-level interface for parallel file
  systems. Key features: linear file model, scatter-gather read and write calls
  (list of strided segments), asynch versions of all calls, extensive hint
  system. Naming structure is unspecified; no directories specified.
  Permissions left out. Some control over client caching and over disk layout.
  Each file has a (small) 'label', which is just a little space for
  application-controlled meta data. Optional extensions: collective read and
  write calls, fast copy.}
}

@InProceedings{corbett:user-friendly,
  author = {Peter F. Corbett and Dror G. Feitelson and Jean-Pierre Prost and
  Marc Snir},
  title = {User-friendly and efficient parallel {I/O}s using the {Vesta}
  parallel file system},
  booktitle = {Transputers '94: Advanced Research and Industrial Applications},
  year = {1994},
  month = {September},
  pages = {23--38},
  publisher = {IOS Press},
  keywords = {multiprocessor file system interface, parallel I/O, Vesta,
  pario-bib}
}

@InProceedings{corbett:vesta,
  author = {Peter F. Corbett and Sandra Johnson Baylor and Dror G. Feitelson},
  title = {Overview of the {Vesta} Parallel File System},
  booktitle = {Proceedings of the IPPS~'93 Workshop on Input/Output in Parallel
  Computer Systems},
  year = {1993},
  pages = {1--16},
  address = {Newport Beach, CA},
  note = {Also published in Computer Architecture News 21(5), December 1993,
  pages 7--14},
  later = {corbett:jvesta},
  keywords = {parallel I/O, multiprocessor file system, concurrent file
  checkpointing, multiprocessor file system interface, Vesta, pario-bib},
  comment = {See corbett:jvesta. Design of a file system for a message-passing
  MIMD multiprocessor to be used for scientific computing. Separate I/O nodes
  from compute nodes; I/O nodes and disks are viewed as a data-staging area.
  File system runs on I/O nodes only. Files declustered by record, among
  physical partitions, each residing on a separate disk, and each separately
  growable. Then the user maps logical partitions, one per process, on the file
  at open time. These are designed to be two-dimensional, so that mapping
  arrays of various strides and contiguities, with records as the basic unit,
  is easy. Various consistency and atomicity requirements. File checkpointing,
  really snapshotting, is built in. No client caching, no redundancy for
  reliability. See also corbett:vesta2, corbett:vesta3, feitelson:pario.}
}

@InProceedings{corbett:vesta-di,
  author = {Peter F. Corbett and Dror G. Feitelson},
  title = {Design and Implementation of the {Vesta} Parallel File System},
  booktitle = {Proceedings of the Scalable High-Performance Computing
  Conference},
  year = {1994},
  pages = {63--70},
  later = {corbett:jvesta},
  keywords = {parallel I/O, multiprocessor file system, file system interface,
  Vesta, pario-bib},
  abstract = {The Vesta Parallel file system is designed to provide parallel
  file access to application programs running on multicomputers with parallel
  I/O subsystems. Vesta uses a new abstraction of files: a file is not a
  sequence of bytes, but rather it can be partitioned into multiple disjoint
  sequences that are accessed in parallel. The partitioning - which can also be
  changed dynamically - reduces the need for synchronization and coordiantion
  during the access. Some control over the layout of data is also provided, so
  the layout can be matched with the anticipated access patterns. The system is
  fully implemented, and is beginning to be used by application programmers.
  The implementation does not compromise scalability or parallelism. In fact,
  all data accesses are done directly to the I/O node that contains the
  requested data, without any indirection or access to shared metadata. There
  are no centralized control points in the system.},
  comment = {See corbett:jvesta and corbett:vesta* for other background. Note
  that since this paper they have put Vesta on top of a raw disk (using 64 KB
  blocks) rather than on top of AIX-JFS. They describe here the structure of
  Vesta (2-d files, cells, subfiles, etc), the ordering of bytes within a
  subfile, hashing of the file name to find the file metadata, Xrefs instead of
  directories, caching, asynchronous I/O, prefetching, shared file pointers,
  concurrency control, and block-list structure. Many things, some visible to
  the user and some not, are new.}
}

@TechReport{corbett:vesta-man,
  author = {Peter F. Corbett and Dror G. Feitelson},
  title = {Vesta File System Programmer's Reference},
  year = {1994},
  month = {October},
  number = {Research Report RC~19898 (88058)},
  institution = {IBM T.J. Watson Research Center},
  address = {Yorktown Heights, NY 10598},
  note = {Version 1.01},
  keywords = {multiprocessor file system, parallel I/O, Vesta, pario-bib},
  comment = {Complete user's manual of the Vesta file system. Impressive in its
  completeness (e.g., it has user quotas). Handy for its detailed description of
  the interface, but doesn't say much (of course) about the implementation.}
}

@InProceedings{corbett:vesta2,
  author = {Peter F. Corbett and Dror G. Feitelson and Jean-Pierre Prost and
  Sandra Johnson Baylor},
  title = {Parallel Access to Files in the {Vesta} File System},
  booktitle = {Proceedings of Supercomputing '93},
  year = {1993},
  pages = {472--481},
  publisher = {IEEE Computer Society Press},
  address = {Portland, OR},
  later = {corbett:jvesta},
  keywords = {multiprocessor file system, file checkpointing, Vesta,
  pario-bib},
  comment = {See also corbett:jvesta, corbett:vesta, corbett:vesta3,
  feitelson:pario. A new abstraction and a new interface. Typical systems use
  transparent striping, and access modes. They believe that ``optimization
  requires control''. Need to be able to tell the system what you want.
  User-defined or default. Asynch I/O. Concurrency control. Checkpointing.
  Export/import to external storage. New abstraction: file is multiple
  sequences of records. Each process sees a logical partition of the file.
  Physical partition is one or more disks. Logical partition defined in terms
  of records. Can repartition without moving data. Rectilinear decompositions
  of file data to processors. They can do gather/scatter requests. Using
  logical partitions give system the knowledge that user's accesses are
  disjoint. Collective operations with consistency checks, vs. independent
  access. Collective open defines logical view, then synch, then check that
  partitions are disjoint. If not, then they have access modes to define
  semantics (more or less the same as other systems). Consider this a target
  for HPF, etc. Physical partitioning (record size and number of partitions) is
  defined at create time. Can they have different physical or logical partition
  sizes in the same file? Future: parallel pipelines, ``out-of-core'' backing
  store for HPF arrays, high-level operations, collective operations.}
}

@InProceedings{cormen:bmmc,
  author = {Thomas H. Cormen and Leonard F. Wisniewski},
  title = {Asymptotically Tight Bounds for Performing {BMMC} Permutations on
  Parallel Disk Systems},
  booktitle = {Proceedings of the Fifth Symposium on Parallel Algorithms and
  Architectures},
  year = {1993},
  month = {June},
  pages = {130--139},
  publisher = {ACM Press},
  copyright = {ACM},
  later = {cormen:bmmc-tr},
  keywords = {parallel I/O, algorithm, pario-bib},
  comment = {Earlier version available as Dartmouth tech report PCS-TR93-193.
  But the most recent and complete version is Dartmouth PCS-TR94-223,
  cormen:bmmc-tr.}
}

@TechReport{cormen:bmmc-tr,
  author = {Thomas H. Cormen and Thomas Sundquist and Leonard F. Wisniewski},
  title = {Asymptotically Tight Bounds for Performing {BMMC} Permutations on
  Parallel Disk Systems},
  year = {1994},
  month = {July},
  number = {PCS-TR94-223},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  note = {Preliminary version also appeared in Proceedings of the Fifth
  Symposium on Parallel Algorithms and Architectures},
  earlier = {cormen:bmmc},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/96/},
  keywords = {parallel I/O algorithms, pario-bib},
  comment = {Supercedes cormen:bmmc.}
}

@Article{cormen:early-vic,
  author = {Thomas H. Cormen and Melissa Hirschl},
  title = {Early Experiences in Evaluating the Parallel Disk Model with the
  {ViC*} Implementation},
  journal = {Parallel Computing},
  year = {1997},
  month = {June},
  volume = {23},
  number = {4},
  pages = {571--600},
  publisher = {North-Holland (Elsevier Scientific)},
  copyright = {North-Holland (Elsevier Scientific)},
  earlier = {cormen:early-vic-tr},
  keywords = {parallel I/O, parallel I/O algorithm, compiler, pario-bib},
  abstract = {Although several algorithms have been developed for the Parallel
  Disk Model (PDM), few have been implemented. Consequently, little has been
  known about the accuracy of the PDM in measuring I/O time and total running
  time to perform an out-of-core computation. This paper analyzes timing
  results on multiple-disk platforms for two PDM algorithms, out-of-core radix
  sort and BMMC permutations, to determine the strengths and weaknesses of the
  PDM. \par The results indicate the following. First, good PDM algorithms are
  usually not I/O bound. Second, of the four PDM parameters, one (problem size)
  is a good indicator of I/O time and running time, one (memory size) is a good
  indicator of I/O time but not necessarily running time, and the other two
  (block size and number of disks) do not necessarily indicate either I/O or
  running time. Third, because PDM algorithms tend not to be I/O bound, using
  asynchronous I/O can reduce I/O wait times significantly. \par The software
  interface to the PDM is part of the ViC* run-time library. The interface is a
  set of wrappers that are designed to be both efficient and portable across
  several underlying file systems and target machines.}
}

@TechReport{cormen:early-vic-tr,
  author = {Thomas H. Cormen and Melissa Hirschl},
  title = {Early Experiences in Evaluating the Parallel Disk Model with the
  {ViC*} Implementation},
  year = {1996},
  month = {August},
  number = {PCS-TR96-293},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  later = {cormen:early-vic},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/135/},
  keywords = {parallel I/O, parallel I/O algorithm, compiler, pario-bib},
  abstract = {Although several algorithms have been developed for the Parallel
  Disk Model (PDM), few have been implemented. Consequently, little has been
  known about the accuracy of the PDM in measuring I/O time and total time to
  perform an out-of-core computation. This paper analyzes timing results on a
  uniprocessor with several disks for two PDM algorithms, out-of-core radix
  sort and BMMC permutations, to determine the strengths and weaknesses of the
  PDM. \par The results indicate the following. First, good PDM algorithms are
  usually not I/O bound. Second, of the four PDM parameters, two (problem size
  and memory size) are good indicators of I/O time and running time, but the
  other two (block size and number of disks) are not. Third, because PDM
  algorithms tend not to be I/O bound, asynchronous I/O effectively hides I/O
  times. \par The software interface to the PDM is part of the ViC* run-time
  library. The interface is a set of wrappers that are designed to be both
  efficient and portable across several parallel file systems and target
  machines.},
  comment = {This used to be called cormen:early-vic but I renamed it because
  the paper will appear in parcomp.}
}

@Article{cormen:fft,
  author = {Thomas H. Cormen and David M. Nicol},
  title = {Performing Out-of-Core {FFTs} on Parallel Disk Systems},
  journal = {Parallel Computing},
  year = {1998},
  month = {January},
  volume = {24},
  number = {1},
  pages = {5--20},
  publisher = {North-Holland},
  copyright = {North-Holland},
  earlier = {cormen:fft-tr},
  keywords = {parallel I/O, out of core, scientific computing, FFT, pario-bib},
  abstract = {The fast Fourier transform (FFT) plays a key role in many areas
  of computational science and engineering. Although most one-dimensional FFT
  problems can be solved entirely in main memory, some important classes of
  applications require out-of-core techniques. For these, use of parallel I/O
  systems can improve performance considerably. This paper shows how to perform
  one-dimensional FFTs using a parallel disk system with independent disk
  accesses. We present both analytical and experimental results for performing
  out-of-core FFTs in two ways: using traditional virtual memory with demand
  paging, and using a provably asymptotically optimal algorithm for the
  parallel disk model (PDM) of I.S. Vitter and E.A.M. Shriver (1994). When run
  on a DEC 2100 server with a large memory and eight parallel disks, the
  optimal algorithm for the PDM runs up to 144.7 times faster than in-core
  methods under demand paging. Moreover, even including I/O costs, the
  normalized times for the optimal PDM algorithm are competitive, or better
  than, those for in-core methods even when they run entirely in memory.},
  comment = {see also cormen:fft2 and cormen:fft3. Part of a special issue.}
}

@TechReport{cormen:fft-tr,
  author = {Thomas H. Cormen and David M. Nicol},
  title = {Performing Out-of-Core {FFTs} on Parallel Disk Systems},
  year = {1996},
  number = {PCS-TR96-294},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  later = {cormen:fft},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/136/},
  keywords = {parallel I/O, out of core, scientific computing, FFT, pario-bib},
  abstract = {The Fast Fourier Transform (FFT) plays a key role in many areas
  of computational science and engineering. Although most one-dimensional FFT
  problems can be solved entirely in main memory, some important classes of
  applications require out-of-core techniques. For these, use of parallel I/O
  systems can improve performance considerably. This paper shows how to perform
  one-dimensional FFTs using a parallel disk system with independent disk
  accesses. We present both analytical and experimental results for performing
  out-of-core FFTs in two ways: using traditional virtual memory with demand
  paging, and using a provably asymptotically optimal algorithm for the
  Parallel Disk Model (PDM) of Vitter and Shriver. When run on a DEC 2100
  server with a large memory and eight parallel disks, the optimal algorithm
  for the PDM runs up to 144.7 times faster than in-core methods under demand
  paging. Moreover, even including I/O costs, the normalized times for the
  optimal PDM algorithm are competitive, or better than, those for in-core
  methods even when they run entirely in memory.}
}

@TechReport{cormen:fft2-tr,
  author = {Thomas H. Cormen and Jake Wegmann and David M. Nicol},
  title = {Multiprocessor Out-of-Core {FFTs} with Distributed Memory and
  Parallel Disks},
  year = {1997},
  number = {PCS-TR97-303},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  later = {cormen:fft3},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/143/},
  keywords = {parallel I/O, out of core, scientific computing, FFT, pario-bib},
  abstract = {This paper extends an earlier out-of-core Fast Fourier Transform
  (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use
  multiple processors. Four out-of-core multiprocessor methods are examined.
  Operationally, these methods differ in the size of "mini-butterfly" computed
  in memory and how the data are organized on the disks and in the distributed
  memory of the multiprocessor. The methods also perform differing amounts of
  I/O and communication. Two of them have the remarkable property that even
  though they are computing the FFT on a multiprocessor, all interprocessor
  communication occurs outside the mini-butterfly computations. Performance
  results on a small workstation cluster indicate that except for unusual
  combinations of problem size and memory size, the methods that do not perform
  interprocessor communication during the mini-butterfly computations require
  approximately 86\% of the time of those that do. Moreover, the faster methods
  are much easier to implement.},
  comment = {Extends the work of cormen:fft.}
}

@InProceedings{cormen:fft3,
  author = {Thomas H. Cormen and Jake Wegmann and David M. Nicol},
  title = {Multiprocessor Out-of-Core {FFTs} with Distributed Memory and
  Parallel Disks},
  booktitle = {Proceedings of the Fifth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1997},
  month = {November},
  pages = {68--78},
  publisher = {ACM Press},
  copyright = {ACM},
  address = {San Jose, CA},
  keywords = {out of core, parallel I/O, pario-bib},
  abstract = {This paper extends an earlier out-of-core Fast Fourier Transform
  (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use
  multiple processors. Four out-of-core multiprocessor methods are examined.
  Operationally, these methods differ in the size of "mini-butterfly" computed
  in memory and how the data are organized on the disks and in the distributed
  memory of the multiprocessor. The methods also perform differing amounts of
  I/O and communication. Two of them have the remarkable property that even
  though they are computing the FFT on a multiprocessor, all interprocessor
  communication occurs outside the mini-butterfly computations; communication
  that ordinarily occurs in a butterfly is folded into other data-movement
  operations. An analysis program shows that the two methods that use no
  butterfly communication usually use less communication overall than the other
  methods. The analysis program is fast enough that it can be invoked at run
  time to determine which of the four methods uses the least communication. One
  set of performance results on a small workstation cluster indicates that the
  methods without butterfly communication are approximately 9.5\% faster.
  Moreover, they are much easier to implement.},
  comment = {They find a way to move the interprocessor communication involved
  in the out-of-core FFT into a single BMMC permutation between "super-levels",
  where each super-level involves log(M) stages of the FFT. This usually leads
  to less communication and to better overall performance. See also cormen:fft
  and cormen:fft2.}
}

@InProceedings{cormen:fg,
  author = {Thomas H. Cormen and Elena R. Davidson},
  title = {{FG:} A framework generator for hiding latency in parallel programs
  running on clusters},
  booktitle = {Proceedings of the 17th IASTED International Conference on
  Parallel and Distributed Computing and Systems},
  editor = {Bader, DA; Khokhar, AA},
  year = {2004},
  month = {September},
  pages = {137--144},
  institution = {Dartmouth Coll, Dept Comp Sci, 6211 Sudikoff Lab, Hanover, NH
  03755 USA; Dartmouth Coll, Dept Comp Sci, Hanover, NH 03755 USA},
  publisher = {INTERNATIONAL SOCIETY COMPUTER S \& THEIR APPLICATIONS (ISCA)},
  copyright = {(c)2005 The Thomson Corporation},
  address = {San Francisco, CA},
  URL = {http://www.cs.dartmouth.edu/FG/},
  keywords = {asynchronous I/O, pipelined I/O, pario-bib},
  abstract = {FG is a programming environment for asynchronous programs that
  run on clusters and fit into a pipeline framework. It enables the programmer
  to write a series of synchronous functions and represents them as stages of
  an asynchronous pipeline. FG mitigates the high latency inherent in
  interprocessor communication and accessing the outer levels of the memory
  hierarchy. It overlaps separate pipeline stages that perform communication,
  computation, and I/O by running the stages asynchronously. Each stage maps to
  a thread. Buffers, whose sizes correspond to block sizes in the memory
  hierarchy, traverse the pipeline. FG makes such pipeline-structured parallel
  programs easier to write, smaller, and faster. FG offers several advantages
  over statically scheduled overlapping and dynamically scheduled overlapping
  via explicit calls to thread functions. First, it reduces coding and
  debugging time. Second, we find that it reduces code size by approximately
  15-26%. Third, according to experimental results, it improves performance.
  Compared with programs that use static scheduling, FG-generated programs run
  approximately 61-69% faster on a 16-node Beowulf cluster. Compared with
  programs that make explicit calls for dynamically scheduled threads,
  FG-generated programs run slightly faster. Fourth, FG offers various design
  options and makes it easy for the programmer to explore different pipeline
  configurations.}
}

@InProceedings{cormen:integrate,
  author = {Thomas H. Cormen and David Kotz},
  title = {Integrating Theory and Practice in Parallel File Systems},
  booktitle = {Proceedings of the 1993 DAGS/PC Symposium},
  year = {1993},
  month = {June},
  pages = {64--74},
  organization = {Dartmouth Institute for Advanced Graduate Studies},
  copyright = {the authors},
  address = {Hanover, NH},
  note = {Revised as Dartmouth PCS-TR93-188 on 9/20/94},
  later = {cormen:integrate-tr},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/cormen:integrate.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/cormen:integrate.pdf},
  keywords = {parallel I/O, multiprocessor file systems, algorithm, file system
  interface, dfk, pario-bib},
  abstract = {Several algorithms for parallel disk systems have appeared in the
  literature recently, and they are asymptotically optimal in terms of the
  number of disk accesses. Scalable systems with parallel disks must be able to
  run these algorithms. We present for the first time a list of capabilities
  that must be provided by the system to support these optimal algorithms:
  control over declustering, querying about the configuration, independent I/O,
  and turning off parity, file caching, and prefetching. We summarize recent
  theoretical and empirical work that justifies the need for these
  capabilities. In addition, we sketch an organization for a parallel file
  interface with low-level primitives and higher-level operations.},
  comment = {Describing the file system capabilities needed by parallel I/O
  algorithms to effectively use a parallel disk system. Revised as Dartmouth
  PCS-TR93-188 (updated).}
}

@TechReport{cormen:integrate-tr,
  author = {Thomas H. Cormen and David Kotz},
  title = {Integrating Theory and Practice in Parallel File Systems},
  year = {1993},
  month = {March},
  number = {PCS-TR93-188},
  institution = {Dept. of Math and Computer Science, Dartmouth College},
  copyright = {the authors},
  note = {Revised 9/20/94},
  earlier = {cormen:integrate},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/80/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/cormen:integrate-tr.pdf},
  keywords = {parallel I/O, multiprocessor file systems, algorithm, file system
  interface, dfk, pario-bib},
  abstract = {Several algorithms for parallel disk systems have appeared in the
  literature recently, and they are asymptotically optimal in terms of the
  number of disk accesses. Scalable systems with parallel disks must be able to
  run these algorithms. We present a list of capabilities that must be provided
  by the system to support these optimal algorithms: control over declustering,
  querying about the configuration, independent I/O, turning off file caching
  and prefetching, and bypassing parity. We summarize recent theoretical and
  empirical work that justifies the need for these capabilities.},
  comment = {Describing the file system capabilities needed by parallel I/O
  algorithms to effectively use a parallel disk system. Cite cormen:integrate.}
}

@Article{cormen:jbmmc,
  author = {T.~H. Cormen and T. Sundquist and L.~F. Wisniewski},
  title = {Asymptotically tight bounds for performing {BMMC} permutations on
  parallel disk systems},
  journal = {SIAM Journal on Computing},
  year = {1998},
  volume = {28},
  number = {1},
  pages = {105--136},
  publisher = {SIAM},
  copyright = {SIAM},
  keywords = {parallel I/O, parallel I/O algorithms, pario-bib},
  abstract = {This paper presents asymptotically equal lower and upper bounds
  for the number of parallel I/O operations required to perform
  bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model
  proposed by J.S. Vitter and E.A.M. Shriver (1994). A BMMC permutation maps a
  source index to a target index by an affine transformation over GF(2), where
  the source and target indices are treated as bit vectors. The class of BMMC
  permutations includes many common permutations, such as matrix transposition
  (when dimensions are powers of 2), bit-reversal permutations, vector-reversal
  permutations, hypercube permutations, matrix reblocking, Gray-code
  permutations, and inverse Gray-code permutations. The upper bound improves
  upon the asymptotic bound in the previous best known BMMC algorithm and upon
  the constant factor in the previous best known bit-permute/complement (BPC)
  permutation algorithm. The algorithm achieving the upper bound uses basic
  linear-algebra techniques to factor the characteristic matrix for the BMMC
  permutation into a product of factors, each of which characterizes a
  permutation that can be performed in one pass over the data. The factoring
  uses new subclasses of BMMC permutations: memoryload-dispersal (MLD)
  permutations and their inverses. These subclasses extend the catalog of
  one-pass permutations. Although many BMMC permutations of practical interest
  fall into subclasses that might be explicitly invoked within the source code,
  this paper shows how to quickly detect whether a given vector of target
  addresses specifies a BMMC permutation. Thus, one can determine efficiently
  at run time whether a permutation to be performed is BMMC and then avoid the
  general-permutation algorithm and save parallel I/Os by using the BMMC
  permutation algorithm herein}
}

@Article{cormen:oocfft,
  author = {Thomas H. Cormen and David .M. Nicol},
  title = {Out-of-Core {FFTs} with Parallel Disks},
  journal = {ACM SIGMETRICS Performance Evaluation Review},
  year = {1997},
  month = {December},
  volume = {25},
  number = {3},
  pages = {3--12},
  publisher = {ACM Press},
  copyright = {ACM},
  keywords = {scientific computing, out-of-core computation, parallel I/O,
  pario-bib},
  comment = {Part of a special issue on parallel and distributed I/O.}
}

@Article{cormen:permute,
  author = {Thomas H. Cormen},
  title = {Fast Permuting on Disk Arrays},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1993},
  month = {January and February},
  volume = {17},
  number = {1--2},
  pages = {41--57},
  publisher = {Academic Press},
  copyright = {Academic Press},
  keywords = {parallel I/O algorithm, pario-bib},
  comment = {See also cormen:thesis.}
}

@PhdThesis{cormen:thesis,
  author = {Thomas H. Cormen},
  title = {Virtual Memory for Data-Parallel Computing},
  year = {1992},
  school = {Department of Electrical Engineering and Computer Science,
  Massachusetts Institute of Technology},
  copyright = {Thomas H. Cormen},
  keywords = {parallel I/O, algorithm, pario-bib},
  comment = {Lots of algorithms for out-of-core permutation problems. See also
  cormen:permute, cormen:integrate.}
}

@TechReport{cormen:vic,
  author = {Thomas H. Cormen and Alex Colvin},
  title = {{ViC*}: A Preprocessor for Virtual-Memory {C*}},
  year = {1994},
  month = {November},
  number = {PCS-TR94-243},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/109/},
  keywords = {compiler, out-of-core computation, parallel I/O, pario-bib},
  abstract = {This paper describes the functionality of ViC*, a compiler-like
  preprocessor for out-of-core C*. The input to ViC* is a C* program but with
  certain shapes declared \verb`outofcore`, which means that all parallel
  variables of these shapes reside on disk. The output is a standard C* program
  with the appropriate I/O and library calls added for efficient access to
  out-of-core parallel variables.}
}

@Article{correa:out-of-core,
  author = {Wagner. T. Corr\^ea and James T. Klosowski and Cl\'audio T. Silva},
  title = {Out-of-core sort-first parallel rendering for cluster-based tiled
  displays.},
  journal = {Parallel Computing},
  booktitle = {Eurographics Workshop on Parallel Graphics and Visualization;
  2002; BLAUBEUREN, GERMANY},
  year = {2003},
  month = {March},
  volume = {29},
  number = {3},
  pages = {325--338},
  institution = {Princeton Univ, Princeton, NJ 08540 USA; IBM Corp, Thomas J
  Watson Res Ctr, Yorktown Hts, NY 10598 USA; Oregon Hlth Sci Univ, Beaverton,
  OR 97006 USA},
  publisher = {Netherlands : Elsevier, 2003},
  copyright = {(c)2004 IEE; Institute for Scientific Information, Inc.},
  URL = {http://dx.doi.org/10.1016/S0167-8191(02)00249-1},
  keywords = {cluster based tiled displays, out-of-core rendering, sort first
  parallel rendering, pario-bib},
  abstract = {We present a sort-first parallel system for out-of-core rendering
  of large models on cluster-based tiled displays. The system renders
  high-resolution images of large models at interactive frame rates using
  off-the-shelf PCs with small memory. Given a model, we use an out-of-core
  preprocessing algorithm to build an on-disk hierarchical representation for
  the model. At run time, each PC renders the image for a display tile, using
  an out-of-core rendering approach that employs multiple threads to overlap
  rendering, visibility computation, and disk operations. The system can
  operate in approximate mode for real-time rendering, or in conservative mode
  for rendering with guaranteed accuracy. Running our system in approximate
  mode on a cluster of 16 PCs each with 512 MB of main memory, we are able to
  render 12-megapixel images of a 13-million-triangle model with 99.3% of
  accuracy at 10.8 frames per second. Rendering such a large model at high
  resolutions and interactive frame rates would typically require expensive
  high-end graphics hardware. Our results show that a cluster of inexpensive
  PCs is an attractive alternative to those high-end systems. (36 refs.)}
}

@InCollection{cortes:bcooperative,
  author = {Cortes, Toni and Girona, Sergi and Labarta, Jes\'us},
  title = {Design Issues of a Cooperative Cache with no Coherence Problems},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {18},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {259--270},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {cortes:cooperative},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {cooperative caching, distributed file system, parallel I/O,
  pario-bib},
  comment = {Part of jin:io-book; reformatted version of cortes:cooperative.}
}

@InCollection{cortes:bookchap,
  author = {Toni Cortes},
  title = {Software {RAID} and Parallel Filesystems},
  booktitle = {High Peformance Cluster Computing},
  editor = {Rajkumar Buyya},
  year = {1999},
  pages = {463--496},
  publisher = {Prentice Hall PTR},
  keywords = {parallel file system, RAID, cluster computing, parallel I/O,
  pario-bib}
}

@InProceedings{cortes:cooperative,
  author = {Toni Cortes and Sergi Girona and Jes\'us Labarta},
  title = {Design Issues of a Cooperative Cache with no Coherence Problems},
  booktitle = {Proceedings of the Fifth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1997},
  month = {November},
  pages = {37--46},
  publisher = {ACM Press},
  address = {San Jose, CA},
  later = {cortes:bcooperative},
  URL = {http://doi.acm.org/10.1145/266220.266224},
  keywords = {cooperative caching, distributed file system, parallel I/O,
  pario-bib},
  abstract = {In this paper, we examine some of the important problems observed
  in the design of cooperative caches. Solutions to the coherence,
  load-balancing and fault-tolerance problems are presented. These solutions
  have been implemented as a part of PAFS, a parallel/distributed file system,
  and its performance has been compared to the one achieved by xFS. Using the
  comparison results, we have observed that the proposed ideas not only solve
  the main problems of cooperative caches, but also increase the overall system
  performance. Although the solutions presented in this paper were targeted to
  a parallel machine, reasonable good results have also been obtained for
  networks of workstations.},
  comment = {They make the claim that it is better not to replicate data into
  local client caches, rather, it is better to simply make remote read and
  write requests to the cached block in whatever memory it may be. That reduces
  the overhead (space and time) of replication and coherency, and leads to
  better performance. They also present a range of parity-based fault-tolerance
  mechanisms, and a load-balancing technique that reassigns cache buffers to
  cache-manager processes.}
}

@Article{cortes:hetero2,
  author = {Tony Cortes and Jes\'us Labarta},
  title = {Taking advantage of heterogeneity in disk arrays},
  journal = {Journal of Parallel and Distributed Computing},
  year = {2003},
  month = {April},
  volume = {63},
  number = {4},
  pages = {448--464},
  institution = {Univ Politecn Catalunya, Dept Arquitectura Comp, Campus Nord,
  C6-E202, C Jordi Girona 1-3, ES-08034 Barcelona, Spain; Univ Politecn
  Catalunya, Dept Arquitectura Comp, ES-08034 Barcelona, Spain},
  publisher = {Academic Press Inc Elsevier Science},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  keywords = {AdaptRaid, block distribution, heterogeneity; RAID, parallel I/O,
  pario-bib},
  abstract = {Disk arrays, or RAIDs, have become the solution to increase the
  capacity and bandwidth of most storage system, but their usage has some
  limitations because all the disks in the array have to be equal. Nowadays,
  assuming a homogeneous set of disks to build an array is becoming a not very
  realistic assumption in many environments, especially in low-cost clusters of
  workstations. It is difficult to find a disk with the same characteristics as
  the ones in the array and replacing or adding new disks breaks the
  homogeneity. In this paper, we propose two block-distribution algorithms (one
  for RAID0 and an extension for RAID5) that can be used to build disk arrays
  from a heterogeneous set of disks. We also show that arrays using this
  algorithm are able to serve many more disk requests per second than when
  blocks are distributed assuming that all disks have the lowest common speed,
  which is the solution currently being used. (C) 2003 Elsevier Science (USA).
  All rights reserved.}
}

@InProceedings{cortes:heterogeneity,
  author = {T. Cortes and J. Labarta},
  title = {Extending Heterogeneity to {RAID} level 5},
  booktitle = {Proceedings of the 2001 USENIX Technical Conference},
  year = {2001},
  month = {June},
  pages = {119--132},
  publisher = {USENIX Association},
  address = {Boston},
  URL =
  {http://www.usenix.org/publications/library/proceedings/usenix01/cortes.html},
  keywords = {parallel I/O, RAID, pario-bib},
  abstract = {RAIDs level 5 are one of the most widely used kind of disk array,
  but their usage has some limitations because all the disks in the array have
  to be equal. Nowadays, assuming a homogeneous set of disks to build an array
  is becoming a not very realistic assumption in many environments, especially
  in low-cost clusters of workstations. It is difficult to and a disk with the
  same characteristics as the ones in the array and replacing or adding new
  disks breaks the homogeneity. In this paper, we propose a block- distribution
  algorithm that can be used to build disk arrays from a heterogeneous set of
  disks. We also show that arrays using this algorithm are able to serve many
  more disk requests per second than when blocks are distributed assuming that
  all disks have the lowest common speed, which is the solution currently being
  used.},
  comment = {The web page for the project is
  http://people.ac.upc.es/toni/AdaptRaid.html}
}

@InProceedings{cortes:heterogeneous,
  author = {T. Cortes and J. Labarta},
  title = {A Case for Heterogenenous Disk Arrays},
  booktitle = {Proceedings of the IEEE International Conference on Cluster
  Computing (Cluster'2000)},
  year = {2000},
  month = {November},
  pages = {319--325},
  publisher = {IEEE Computer Society Press},
  keywords = {disk array, parallel I/O, pario-bib},
  abstract = {Heterogeneous disk arrays are becoming a common configuration in
  many sites and specially in storage area networks (SAN). As new disks have
  different characteristics than old ones, adding new disks or replacing old
  ones ends up in a heterogeneous disk array. Current solutions to this kind of
  arrays do not take advantage of the improved characteristics of the new
  disks. In this paper, we present a block-distribution algorithm that takes
  advantage of these new characteristics and thus improves the performance and
  capacity of heterogeneous disk arrays compared to current solutions.},
  comment = {The technical report associated with this paper can be found at
  ftp://ftp.ac.upc.es/pub/reports/DAC/2000/UPC-DAC-2000-76.ps.Z}
}

@InProceedings{cortes:hraid,
  author = {Toni Cortes and Jes\'us Labarta},
  title = {{HRaid}: A Flexible Storage-system Simulator},
  booktitle = {Proceedings of the International Conference on Parallel and
  Distributed Processing Techniques and Applications},
  year = {1999},
  month = {June},
  pages = {772--778},
  publisher = {CSREA Press},
  URL = {ftp://ftp.ac.upc.es/pub/reports/DAC/1999/UPC-DAC-1999-14.ps.Z},
  keywords = {simulation, RAID, disk array, storage system, heterogeneous
  system, parallel I/O, pario-bib},
  abstract = {Clusters of workstations are becoming a quite popular platform to
  run high-performance applications. This fact has stressed the need of
  high-performance storage systems for this kind of environments. In order to
  design such systems, we need adequate tools, which should be flexible enough
  to model a cluster of workstations. Currently available simulators do not
  allow heterogeneity (several kind of disks), hierarchies or resource sharing
  (among others), which are quite common in clusters. To fill this gap, we have
  designed and implemented HRaid, which is a very flexible and easy to use
  storage-system simulator. In this paper, we present this simulator, its main
  abstractions and some simple examples of how it can be used.}
}

@InProceedings{cortes:lessons,
  author = {Toni Cortes},
  title = {Parallel {I/O}: lessons learnt in the last 20 years},
  booktitle = {Proceedings of the 2004 IEEE International Conference on Cluster
  Computing},
  year = {2004},
  month = {September},
  publisher = {Piscataway, NJ, USA : IEEE, 2004},
  copyright = {(c)2005 IEE},
  address = {San Diego, CA},
  keywords = {tutorial, parallel I/O overview, pario-bib},
  abstract = {Summary form only given. After these two decades, it is now a
  good time to go through all the done work and try to learn the important
  lessons all these parallel I/O initiatives have taught us. This paper aims at
  giving this global overview. The focus is not on commercial/academic
  systems/prototypes, but on the concepts that lay behind them. These concepts
  have normally been applied at different levels, and thus, such an overview
  can be of interest to many people ranging from the hardware design to the
  application implementation. Some of the most important concepts that are
  discussed are, among others, data placement (RAIDs, 2D and 3D files, ...),
  network architectures for parallel I/O (Network attached devices, SAN, ...),
  parallel caching and prefetching (cooperative caching, Informed caching and
  prefetching, ...), and interfaces (collective I/O, data distribution
  interfaces, ...).},
  comment = {Tutorial given at Cluster 2004.}
}

@InProceedings{cortes:paca,
  author = {Toni Cortes and Sergi Girona and Jes\'us Labarta},
  title = {{PACA}: A Cooperative File System Cache for Parallel Machines},
  booktitle = {Proceedings of the 2nd International Euro-Par Conference},
  year = {1996},
  month = {August},
  pages = {I:477--486},
  earlier = {cortes:paca-tr},
  keywords = {file caching, multiprocessor file system, cooperative caching,
  parallel I/O, pario-bib},
  abstract = {A new cooperative caching mechanism, PACA, along with a caching
  algorithm, LRU-Interleaved, and an aggressive prefetching algorithm,
  Full-File-On-Open, are presented. The caching algorithm is especially
  targeted to parallel machines running a microkernel-based operating system.
  It avoids the cache coherence problem with no loss in performance. Comparing
  our algorithm with N-Chance Forwarding, in the above environment, better
  results have been obtained by LRU-Interleaved. We also evaluate an aggressive
  prefetching algorithm that highly increases read performance taking advantage
  of the huge caches cooperative caching offers.},
  comment = {Contact toni@ac.upc.es. See also the a longer version of the
  paper, cortes:paca-tr.}
}

@TechReport{cortes:paca-tr,
  author = {Toni Cortes and Sergi Girona and Jes\'us Labarta},
  title = {{PACA}: A Cooperative File System Cache for Parallel Machines},
  year = {1996},
  number = {96-07},
  institution = {UPC-CEPBA},
  later = {cortes:paca},
  URL = {ftp://ftp.ac.upc.es/pub/reports/CEPBA/1996/UPC-CEPBA-1996-7.ps.Z},
  keywords = {file caching, multiprocessor file system, cooperative caching,
  parallel I/O, pario-bib},
  comment = {See cortes:paca.}
}

@InProceedings{cortes:pafs,
  author = {Toni Cortes and Sergi Girona and Jes\'us Labarta},
  title = {Avoiding the Cache-Coherence Problem in a Parallel/Distributed File
  System},
  booktitle = {Proceedings of High-Performance Computing and Networking},
  year = {1997},
  month = {April},
  pages = {860--869},
  later = {cortes:pafs2},
  keywords = {file caching, multiprocessor file system, cooperative caching,
  cache coherence, parallel I/O, pario-bib},
  abstract = {In this paper we describe PAFS, a new parallel/distributed file
  system. Within the whole file system, special interest is placed on the
  caching mechanism. We present a cooperative cache that has the advantages of
  cooperation and avoids the problems derived from the coherence mechanisms.
  Furthermore, this has been achieved with a reasonable gain in performance. In
  order to show the obtained performance, we present a comparison between PAFS
  and xFS (a file system that also implements a cooperative cache).},
  comment = {Contact toni@ac.upc.es.}
}

@TechReport{cortes:pafs2,
  author = {Toni Cortes and Sergi Girona and Jes\'us Labarta},
  title = {Avoiding the Cache-Coherence Problem in a Parallel/Distributed File
  System},
  year = {1997},
  month = {May},
  number = {UPC-CEPBA-1996-13},
  institution = {UPC-CEPBA},
  earlier = {cortes:pafs},
  URL = {ftp://ftp.ac.upc.es/pub/reports/CEPBA/1996/UPC-CEPBA-1996-13.ps.Z},
  keywords = {file caching, multiprocessor file system, cooperative caching,
  cache coherence, parallel I/O, pario-bib},
  abstract = {In this paper we present PAFS, a new parallel/distributed file
  system. Within the whole file system, special interest is placed on the
  caching and prefetching mechanisms. We present a cooperative cache that
  avoids the coherence problem while it continues to be highly scalable and
  achieves very good performance. We also present an aggressive prefetching
  algorithm that allows full utilization of the big caches offered by the
  cooperative cache mechanism. All the results presented in this paper have
  obtained through simulation using the Sprite workload.},
  comment = {A longer, more detailed version of cortes:pafs.}
}

@InProceedings{cortes:prefetch,
  author = {T. Cortes and J. Labarta},
  title = {Linear Aggressive Prefetching: A Way to Increase the Performance of
  Cooperative Caches},
  booktitle = {Proceedings of the Joint International Parallel Processing
  Symposium and IEEE Symposium on Parallel and Distributed Processing},
  year = {1999},
  month = {April},
  pages = {45--54},
  address = {San Juan, Puerto Rico},
  keywords = {parallel I/O, file access pattern, prefetching, caching,
  simulation, pario-bib},
  abstract = {Cooperative caches offer huge amounts of caching memory that is
  not always used as well as it could be. We might find blocks in the cache
  that have not been requested for many hours. These blocks will hardly improve
  the performance of the system while the buffers they occupy could be better
  used to speed-up the I/O operations. In this paper, we present a family of
  simple prefetching algorithms that increase the file-system performance
  significantly. Furthermore, we also present a way to make any simple
  prefetching algorithm into an aggressive one that controls its aggressiveness
  not to flood the cache unnecessarily. All these algorithms and mechanisms
  have proven to increase the performance of two state-of-the-art
  parallel/distributed file systems: PAFS and xFS.},
  comment = {They present algorithms for "linear aggressive prefetching" for
  systems using a cooperative cache. Two prediction schemes are used: a OBA
  (one block ahead) and IS_PPM (Interval and size -prediction by partial
  match). The aggressive prefetch algorithm continuosly prefetches data until a
  miss-prediction occurs. When a mis-prediction occurs, they realize that they
  were on the wrong path and start prefetching again from the mis-predicted
  block. To limit the aggressiveness of the prefetching, they only allow one
  block from each file to be prefetched at a time. If a single application is
  running, this forces a parallel reads to only utilize one disk at a time.
  They claim, however, that when many files are being accessed they achieve
  good disk utilization. They implemented the prefetching algorithms on the xFS
  \cite{anderson:serverless} and PAFS \cite{cortes:pafs} file systems. They
  used a trace-driven simulator DIMEMAS \cite{labarta:dip} to obtain
  performance results for portions of the CHARISMA and Sprite workloads. The
  results show that using aggressive prefetching does not usually load the
  system more than a system with no prefetching, and sometimes, it even lowers
  the disk traffic.}
}

@PhdThesis{cortes:thesis,
  author = {Toni Cortes},
  title = {Cooperative Caching and Prefetching in Parallel/Distributed File
  Systems},
  year = {1997},
  school = {UPC: Universitat Polit\`ecnica de Catalunya},
  address = {Barcelona, Spain},
  URL = {http://www.ac.upc.es/homes/toni/thesis.html},
  keywords = {parallel I/O, file access pattern, prefetching, caching,
  pario-bib}
}

@InProceedings{courtright:backward,
  author = {William V. {Courtright II} and Garth A. Gibson},
  title = {Backward Error Recovery in Redundant Disk Arrays},
  booktitle = {Proceedings of the Twentieth International Conference for the
  Resource Management and Performance Evaluation of Enterprise Computing
  Systems (CMG)},
  year = {1994},
  month = {December},
  pages = {63--74},
  earlier = {courtright:backward-tr},
  URL = {http://www.cs.cmu.edu/afs/cs/project/pdl/ftp/RAID/CMG94paper.ps},
  keywords = {parallel I/O, disk array, RAID, redundancy, reliability,
  recovery, pario-bib},
  abstract = {Redundant disk arrays are single fault tolerant, incorporating a
  layer of error handling not found in nonredundant disk systems. Recovery from
  these errors is complex, due in part to the large number of erroneous states
  the system may reach. The established approach to error recovery in disk
  systems is to transition directly from an erroneous state to completion. This
  technique, known as forward error recovery, relies upon the context in which
  an error occurs to determine the steps required to reach completion, which
  implies forward error recovery is design specific. Forward error recovery
  requires the enumeration of all erroneous states the system may reach and the
  construction of a forward path from each erroneous state. We propose a method
  of error recovery which does not rely upon the enumeration of erroneous
  states or the context in which errors occur. When an error is encountered, we
  advocate mechanized recovery to an error-free state from which an operation
  may be retried. Using a form of backward error recovery, we are able to
  manage the complexity of error recovery in redundant disk arrays without
  sacrificing performance.},
  comment = {Also available in HTML format at
  http://www.cs.cmu.edu/Web/Groups/PDL/HTML-Papers/CMG94/c.fm.html.}
}

@TechReport{courtright:backward-tr,
  author = {William V. {Courtright II} and Garth A. Gibson},
  title = {Backward Error Recovery in Redundant Disk Arrays},
  year = {1994},
  month = {September},
  number = {CMU-CS-94-193},
  institution = {Carnegie Mellon University},
  later = {courtright:backward},
  URL = {http://www.cs.cmu.edu/afs/cs/project/pdl/ftp/RAID/CMU-CS-94-193.ps},
  keywords = {parallel I/O, disk array, RAID, redundancy, reliability,
  recovery, pario-bib}
}

@InProceedings{courtright:raidframe,
  author = {William V. {Courtright II} and Garth A. Gibson and Mark Holland and
  Jim Zelenka},
  title = {{RAIDframe}: rapid prototyping for disk arrays},
  booktitle = {Proceedings of the 1996 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1996},
  month = {May},
  pages = {268--269},
  publisher = {ACM Press},
  address = {Philadelphia, PA},
  note = {Poster paper.},
  earlier = {gibson:raidframe-tr},
  URL = {http://www.cs.cmu.edu/afs/cs/project/pdl/ftp/RAID/Sigmetrics96.ps},
  keywords = {parallel I/O, RAID, disk array, reliability, simulation,
  pario-bib},
  comment = {See expanded version gibson:raidframe-tr.}
}

@InProceedings{coyne:hpss,
  author = {Robert A. Coyne and Harry Hulen and Richard Watson},
  title = {The High Performance Storage System},
  booktitle = {Proceedings of Supercomputing '93},
  year = {1993},
  pages = {83--92},
  publisher = {IEEE Computer Society Press},
  address = {Portland, OR},
  keywords = {parallel I/O, file system, network, pario-bib},
  comment = {See also coyne:storage.}
}

@Misc{coyne:storage,
  author = {Robert A. Coyne and Harry Hulen and Richard Watson},
  title = {Storage Systems for National Information Assets},
  year = {1993},
  note = {Publication status unknown},
  keywords = {parallel I/O, file system, network, pario-bib},
  comment = {See also coyne:hpss. They describe the National Storage Laboratory
  at LLNL. Collaboration with many companies. The idea is to build a combined
  storage system from many disk and tape components that is networked to
  supercomputers. The philosophy is to separate control and data network
  traffic, so that the overall control can be managed by a (relatively) small
  computer, without the same computer needing to pump all of the data through
  it's CPU. The data would go directly from the devices to the client
  supercomputer. They also want to support multiple hierarchies of data
  storage, so that new technologies can be inserted without disrupting existing
  hierarchies. Access interface is layered so that high-level abstractions can
  be provided as well as low-level control for those who need it.}
}

@InProceedings{cozette:read2,
  author = {Olivier Cozette and Cyril Randriamaro and Gil Utard},
  title = {{READ^2}: Put disks at network level},
  booktitle = {Workshop on Parallel I/O in Cluster Computing and Computational
  Grids},
  year = {2003},
  month = {May},
  pages = {698--704},
  publisher = {IEEE Computer Society Press},
  address = {Tokyo},
  note = {Organized at the IEEE/ACM International Symposium on Cluster
  Computing and the Grid 2003},
  URL =
  {http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190698abs.htm},
  keywords = {parallel I/O, pario-bib},
  abstract = {Grand challenge applications have to process large amounts of
  data, and then require high performance IO systems. Cluster computing is a
  good alternative to proprietary system for building cost effective IO
  intensive platform: some cluster architectures won sorting benchmark
  (MinuteSort, Datamation)! Recent advances in IO component technologies (disk,
  controller and network) let us expect higher IO performance for data
  intensive applications on cluster. The counterpart of this evolution is that
  much stress is put on the different buses (memory, IO) of each node which
  cannot be scaled. In this paper we investigate a strategy we called READ2
  (Remote Efficient Access to Distant Device) to reduce the stress. With READ2
  any cluster node accesses directly to remote disk: the remote processor and
  the remote memory are removed from the control and data path: Inputs/Outputs
  don't interfere with the host processor and the host memory activity. With
  READ2 strategy, a cluster can be considered as a shared disk architecture
  instead of a shared nothing one. This papers describes an implementation of
  READ^2 on Myrinet Networks. First experimental results show IO performance
  improvement.}
}

@InProceedings{crandall:iochar,
  author = {Phyllis E. Crandall and Ruth A. Aydt and Andrew A. Chien and Daniel
  A. Reed},
  title = {Input/Output Characteristics of Scalable Parallel Applications},
  booktitle = {Proceedings of Supercomputing '95},
  year = {1995},
  month = {December},
  publisher = {IEEE Computer Society Press},
  address = {San Diego, CA},
  URL = {http://doi.acm.org/10.1145/224170.224396},
  keywords = {file access pattern, file system workload, workload
  characterization, parallel I/O, pario-bib},
  abstract = {Rapid increases in computing and communication performance are
  exacerbating the long-standing problem of performance-limited input/output.
  Indeed, for many otherwise scalable parallel applications, input/output is
  emerging as a major performance bottleneck. The design of scalable
  input/output systems depends critically on the input/output requirements and
  access patterns for this emerging class of large-scale parallel applications.
  However, hard data on the behavior of such applications is only now becoming
  available. In this paper, we describe the input/output requirements of three
  scalable parallel applications (electron scattering, terrain rendering, and
  quantum chemistry) on the Intel Paragon XP/S. As part of an ongoing parallel
  input/output characterization effort, we used instrumented versions of the
  application codes to capture and analyze input/output volume, request size
  distributions, and temporal request structure. Because complete traces of
  individual application input/output requests were captured, in-depth,
  off-line analyses were possible. In addition, we conducted informal
  interviews of the application developers to understand the relation between
  the codes' current and desired input/output structure. The results of our
  studies show a wide variety of temporal and spatial access patterns,
  including highly read-intensive and write-intensive phases, extremely large
  and extremely small request sizes, and both sequential and highly irregular
  access patterns. We conclude with a discussion of the broad spectrum of
  access patterns and their profound implications for parallel file caching and
  prefetching schemes.},
  comment = {They use the Pablo instrumentation and analysis tools to
  instrument three scalable applications that use heavy I/O: electron
  scattering, terrain rendering, and quantum chemistry. They look at the volume
  of data moved, the timing of I/O, and the periodic nature of I/O. They do a
  little bit with the access patterns of data within each file. They found a
  HUGE variation in request sizes, amount of I/O, number of files, and so
  forth. Their primary conclusion is thus that file systems should be adaptable
  to different access patterns, preferably under control of the application.
  Note proceedings only available on CD-ROM or WWW.}
}

@InCollection{crauser:segment,
  author = {A. Crauser and P. Ferragina and K. Mehlhorn and U. Meyer and E. A.
  Ramos},
  title = {{I/O}-Optimal Computation of Segment Intersections},
  booktitle = {External Memory Algorithms and Visualization},
  editor = {James Abello and Jeffrey Scott Vitter},
  crossref = {abello:dimacs},
  year = {1999},
  series = {DIMACS Series in Discrete Mathematics and Theoretical Computer
  Science},
  pages = {131--138},
  publisher = {American Mathematical Society Press},
  address = {Providence, RI},
  keywords = {parallel I/O, out-of-core algorithm, computational geometry, data
  structure, pario-bib},
  comment = {See also the component papers vitter:survey, arge:lower,
  crauser:segment, grossi:crosstrees, toledo:survey. Not clear to what extent
  these papers are about *parallel* I/O.}
}

@Misc{cray:pario2,
  key = {Cray90},
  author = {Cray Research},
  title = {{DS-41} Disk Subsystem},
  year = {1990},
  note = {Sales literature number MCFS-4-0790},
  keywords = {parallel I/O, disk architecture, disk array, pario-bib},
  comment = {Glossy from Cray describing their new disk subsystem: up two four
  controllers and up to four ``drives'', each of which actually have four
  spindles. Thus, a full subsystem has 16 disks. Each drive or controller
  sustains 9.6 MBytes/sec sustained, for a total of 38.4 MBytes/sec. Each drive
  has 4.8 GBytes, for a total of 19.2 Gbytes. Access time per drive is 2--46.6
  msec, average 24 msec. They don't say how the 4 spindles within a driver are
  controlled or arranged.}
}

@Unpublished{crockett:manual,
  author = {Thomas W. Crockett},
  title = {Specification of the Operating System Interface for Parallel File
  Organizations},
  year = {1988},
  note = {Publication status unknown (ICASE technical report)},
  keywords = {parallel I/O, parallel file system, pario-bib},
  comment = {Man pages for his Flex version of file interface. See
  crockett:par-files.}
}

@InProceedings{crockett:par-files,
  author = {Thomas W. Crockett},
  title = {File Concepts for Parallel {I/O}},
  booktitle = {Proceedings of Supercomputing '89},
  year = {1989},
  pages = {574--579},
  keywords = {parallel I/O, file access pattern, parallel file system,
  pario-bib},
  comment = {Two views of a file: global (for sequential programs) and internal
  (for parallel programs). Standardized forms for these views, for long-lived
  files. Temp files have specialized forms. The access types are sequential,
  partitioned, interleaved, and self-scheduled, plus global random and
  partitioned random. He relates these to their best storage patterns. No
  mention of prefetching. Buffer cache only needed for direct (random) access.
  The application must specify the access pattern desired.}
}

@Article{csa-io,
  author = {T. J. M.},
  title = {Now: Parallel storage to match parallel {CPU} power},
  journal = {Electronics},
  year = {1988},
  month = {December},
  volume = {61},
  number = {12},
  pages = {112},
  keywords = {parallel I/O, disk array, pario-bib}
}

@Article{cypher:jrequire,
  author = {Robert Cypher and Alex Ho and Smaragda Konstantinidou and Paul
  Messina},
  title = {A Quantitative Study of Parallel Scientific Applications with
  Explicit Communication},
  journal = {Journal of Supercomputing},
  year = {1996},
  month = {March},
  volume = {10},
  number = {1},
  pages = {5--24},
  earlier = {cypher:require},
  keywords = {workload characterization, scientific computing, parallel
  programming, message passing, pario-bib},
  comment = {Some mention of I/O.}
}

@InProceedings{cypher:require,
  author = {R. Cypher and A. Ho and S. Konstantinidou and P. Messina},
  title = {Architectural Requirements of Parallel Scientific Applications with
  Explicit Communication},
  booktitle = {Proceedings of the 20th Annual International Symposium on
  Computer Architecture},
  year = {1993},
  pages = {2--13},
  later = {cypher:jrequire},
  keywords = {workload characterization, scientific computing, parallel
  programming, message passing, pario-bib},
  comment = {Some mention of I/O, though only in a limited way. Average
  1207B/MFlop. Some of the applications do I/O throughout their run
  (2400B/MFlop avg), while others only do I/O at the beginning or end
  (14B/MFlop avg). But I/O is bursty, so larger bandwidths are suggested. The
  applications are parallel programs running on Intel Delta, nCUBE/1, nCUBE/2,
  and are in C, FORTRAN, or both.}
}

@Article{davis:rle,
  author = {G. Davis and L. Lau and R. Young and F. Duncalfe and L. Brebber},
  title = {Parallel Run-Length Encoding {(RLE)} Compression---Reducing {I/O} in
  Dynamic Environmental Simulations},
  journal = {The International Journal of High Performance Computing
  Applications},
  year = {1998},
  month = {Winter},
  volume = {12},
  number = {4},
  pages = {396--410},
  note = {In a Special Issue on I/O in Parallel Applications},
  keywords = {parallel I/O application, compression, pario-bib},
  abstract = {Dynamic simulations based on time-varying inputs are extremely
  I/O intensive. This is shown by industrial applications generating
  environmental projections based on seasonal-to-interannual climate forecasts
  which have a compute to data-access ratio of O(n) leading to significant
  performance degradation. Exploitation of compression techniques such as
  Run-Length-Encoding (RLE) significantly reduces the I/O bottleneck and
  storage requirements. Unfortunately, traditional RLE algorithms do not
  perform well in a parallel-vector platform such as the Cray architecture.
  This paper describes the design and implementation of a new RLE algorithm
  based on data chunking and packing that exploits the Cray gather-scatter
  vector hardware and multiple processors. This innovative approach reduces I/O
  and file storage requirements on average by an order of magnitude. Data
  intensive applications such as the integration of environmental and global
  climate models now become practical in a realistic time-frame.},
  comment = {In a Special Issue on I/O in Parallel Applications, volume 12,
  numbers 3 and 4.}
}

@TechReport{dazevedo:edonio,
  author = {E.~F. D'Azevedo and C.~H. Romine},
  title = {{EDONIO}: Extended distributed object network {I/O} library},
  year = {1995},
  number = {ORNL/TM-12934},
  institution = {Oak Ridge National Laboratory},
  keywords = {parallel I/O, pario-bib}
}

@Article{debenedictis:modular,
  author = {Erik P. DeBenedictis and Juan Miguel {del Rosario}},
  title = {Modular Scalable {I/O}},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1993},
  month = {January and February},
  volume = {17},
  number = {1--2},
  pages = {122--128},
  publisher = {Academic Press},
  keywords = {parallel I/O, MIMD, pario-bib},
  comment = {Journalized version of debenedictis:pario, debenedictis:ncube, and
  delrosario:nCUBE.}
}

@InProceedings{debenedictis:ncube,
  author = {Erik DeBenedictis and Juan Miguel del Rosario},
  title = {{nCUBE} Parallel {I/O} Software},
  booktitle = {Proceedings of the Eleventh Annual IEEE International Phoenix
  Conference on Computers and Communications},
  year = {1992},
  month = {April},
  pages = {0117--0124},
  publisher = {IEEE Computer Society Press},
  address = {Scottsdale, AZ},
  keywords = {parallel file system, parallel I/O, pario-bib},
  comment = {Interesting paper. Describes their mechanism for mapping I/O so
  that the file system knows both the mapping of a data structure into memory
  and on the disks, so that it can do the permutation and send the right data
  to the right disk, and back again. Interesting Unix-compatible interface.
  Needs to be extended to handle complex formats.}
}

@InProceedings{debenedictis:pario,
  author = {Erik DeBenedictis and Peter Madams},
  title = {{nCUBE's} Parallel {I/O} with {Unix} Capability},
  booktitle = {Proceedings of the Sixth Annual Distributed-Memory Computer
  Conference},
  year = {1991},
  pages = {270--277},
  keywords = {parallel I/O, multiprocessor file system, file system interface,
  pario-bib},
  comment = {Looks like they give the byte-level mapping, then do normal reads
  and writes; mapping routes the data to and from the correct place. But it
  does let you intermix comp with I/O. Elegant concept. Nice interface. Works
  best for cases where 1) data layout known in advance, data format is known,
  and mapping is regular enough for easy specification. I think that irregular
  or unknown mappings could still be done with a flat mapping.}
}

@Article{debenedictis:scalable-unix,
  author = {Erik P. DeBenedictis and Stephen C. Johnson},
  title = {Extending {Unix} for Scalable Computing},
  journal = {IEEE Computer},
  year = {1993},
  month = {November},
  volume = {26},
  number = {11},
  pages = {43--53},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, Unix, pario-bib},
  comment = {A more polished version of his other papers with del Rosario. The
  mapping-based mechanism is released in nCUBE software 3.0. It does support
  shared file pointers for self-scheduled I/O, as well as support for
  variable-length records, and asynchronous I/O (although the primary mechanism
  is for synchronous, i.e., SPMD, I/O). The basic idea of scalable pipes
  (between programs, devices, etc.) with mappings that determine routings to
  units seems like a good idea.}
}

@InProceedings{debergalis:dafs,
  author = {Matt DeBergalis and Peter Corbett and Steve Kleiman and Arthur Lent
  and Dave Noveck and Tom Talpey and Mark Wittle},
  title = {The Direct Access File System},
  booktitle = {Proceedings of the USENIX FAST '03 Conference on File and
  Storage Technologies},
  year = {2003},
  month = {April},
  publisher = {USENIX Association},
  address = {San Francisco, CA},
  URL = {http://www.usenix.org/events/fast03/tech/debergalis.html},
  keywords = {direct access file system, dafs, remote dma, pario-bib},
  abstract = {The Direct Access File System (DAFS) is a new, fast, and
  lightweight remote file system protocol. DAFS targets the data center by
  addressing the performance and functional needs of clusters of application
  servers. We call this the local file sharing environment. File access
  performance is improved by utilizing Direct Access Transports, such as
  InfiniBand, Remote Direct Data Placement, and the Virtual Interface
  Architecture. DAFS also enhances file sharing semantics compared to prior
  network file system protocols. Applications using DAFS through a user-space
  I/O library can bypass operating system overhead, further improving
  performance. We present performance measurements of an IP-based DAFS network,
  demonstrating the DAFS protocol's lower client CPU requirements over
  commodity Gigabit Ethernet. We also provide the first multiprocessor scaling
  results for a well-known application (GNU gzip) converted to use DAFS.}
}

@Article{delrosario:ncube,
  author = {Juan Miguel del Rosario},
  title = {High Performance Parallel {I/O} on the {nCUBE}~2},
  journal = {Transactions of the Institute of Electronics, Information and
  Communications Engineers},
  year = {1992},
  month = {August},
  volume = {J75D-I},
  number = {8},
  pages = {626--636},
  keywords = {parallel I/O, parallel file system, pario-bib},
  comment = {More detail on the mapping functions, and more flexible mapping
  functions (can be user specified, or some from a library). Striped disks,
  parallel pipes, graphics, and HIPPI supported.}
}

@Article{delrosario:prospects,
  author = {Juan Miguel {del Rosario} and Alok Choudhary},
  title = {High Performance {I/O} for Parallel Computers: Problems and
  Prospects},
  journal = {IEEE Computer},
  year = {1994},
  month = {March},
  volume = {27},
  number = {3},
  pages = {59--68},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, survey, pario-bib},
  comment = {Nice summary of grand-challenge and other applications, and their
  I/O needs. Points out the need for quantitative studies of workloads.
  Comments on architectures, eg, the advent of per-node disk devices. OS
  problems include communication latency, data decomposition, interface,
  prefetching and caching, and checkpointing. Runtime system and compilers are
  important, particularly in reference to data-mapping and re-mapping (see
  delrosario:two-phase). Persistent object stores and networking are mentioned
  briefly.}
}

@InProceedings{delrosario:two-phase,
  author = {Juan Miguel {del Rosario} and Rajesh Bordawekar and Alok
  Choudhary},
  title = {Improved Parallel {I/O} via a Two-Phase Run-time Access Strategy},
  booktitle = {Proceedings of the IPPS~'93 Workshop on Input/Output in Parallel
  Computer Systems},
  year = {1993},
  pages = {56--70},
  address = {Newport Beach, CA},
  note = {Also published in Computer Architecture News 21(5), December 1993,
  pages 31--38},
  earlier = {delrosario:two-phase-tr},
  keywords = {parallel I/O, multiprocessor file system, pario-bib}
}

@TechReport{delrosario:two-phase-tr,
  author = {Juan Miguel del Rosario and Rajesh Bordawekar and Alok Choudhary},
  title = {Improving Parallel {I/O} Performance using a Two-Phase Access
  Strategy},
  year = {1993},
  number = {SCCS--406},
  institution = {NPAC at Syracuse University},
  later = {delrosario:two-phase},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {They show performance measurements of various data distributions
  on an nCUBE and the Touchstone Delta, for reading matrix from a column-major
  file striped across disks, into some distribution across procs. Distributions
  that don't match the I/O distribution are really terrible, due to having
  more, smaller requests, and sometimes mismatching the stripe size (getting
  seg-like contention) or block size (reading partial blocks). They find it is
  better to read the file using the `best' distribution, then to reshuffle the
  data in memory. Big speedups.}
}

@TechReport{delrosario:vipfs-tr,
  author = {Juan Miguel {del Rosario} and Michael Harry and Alok Choudhary},
  title = {The Design of {VIP-FS}: A Virtual, Parallel File System for High
  Performance Parallel and Distributed Computing},
  year = {1994},
  month = {May},
  number = {SCCS-628},
  institution = {NPAC},
  address = {Syracuse, NY 13244},
  URL = {ftp://erc.cat.syr.edu/ece/choudhary/PASSION/vipfs.ps.Z},
  keywords = {parallel I/O, parallel file system, heterogeneous, pario-bib},
  comment = {They are planning a parallel file system that is layered on top of
  standard workstation file systems, to be used by parallel applications on
  heterogeneous workstation clusters. All in user-level libraries, and on a
  per-application basis, application programs can distributed their data among
  many files on many machines. They plan to use a mapped interface like that of
  debenedictis:modular, and support efficient collective I/O in ways
  reminiscent of bennett:jovian and kotz:diskdir. Published as harry:vipfs.}
}

@InProceedings{demmel:eosdis,
  author = {James Demmel and Melody Y. Ivory and Sharon L. Smith},
  title = {Modeling and Identifying Bottlenecks in {EOSDIS}},
  booktitle = {Proceedings of the Sixth Symposium on the Frontiers of Massively
  Parallel Computation},
  year = {1996},
  month = {October},
  pages = {300--308},
  publisher = {IEEE Computer Society Press},
  keywords = {climate modeling, performance modeling, parallel I/O, pario-bib},
  abstract = {Many parallel application areas that exploit massive parallelism,
  such as climate modeling, require massive storage systems for the archival
  and retrieval of data sets. As such, advances in massively parallel
  computation must be coupled with advances in mass storage technology in order
  to satisfy I/O constraints of these applications. We demonstrate the effects
  of such I/O-computation disparity for a representative distributed
  information system, NASA's Earth Observing System Distributed Information
  System (EOSDIS). We use performance modeling to identify bottlenecks in
  EOSDIS for two representative user scenarios from climate change research.}
}

@TechReport{dewitt:gamma,
  author = {David J. {DeWitt} and Robert H. Gerber and Goetz Graefe and Michael
  L. Heytens and Krishna B. Kumar and M. Muralikrishna},
  title = {{GAMMA}: A High Performance Dataflow Database Machine},
  year = {1986},
  month = {March},
  number = {TR-635},
  institution = {Dept. of Computer Science, Univ. of Wisconsin-Madison},
  later = {dewitt:gamma2},
  keywords = {parallel I/O, database, GAMMA, pario-bib},
  comment = {Better to cite dewitt:gamma3. Multiprocessor (VAX) DBMS on a token
  ring with disk at each processor. They thought this was better than
  separating disks from processors by network since then network must handle
  {\em all} I/O rather than just what needs to move. Conjecture that shared
  memory might be best interconnection network. Relations are horizontally
  partitioned in some way, and each processor reads its own set and operates on
  them there.}
}

@InProceedings{dewitt:gamma-dbm,
  author = {David J. DeWitt and Shahram Ghandeharizadeh and Donovan Schneider},
  title = {A Performance Analysis of the {GAMMA} Database Machine},
  booktitle = {Proceedings of the ACM SIGMOD International Conference on
  Management of Data},
  year = {1988},
  month = {June},
  pages = {350--360},
  publisher = {ACM Press},
  address = {Chicago, IL},
  later = {dewitt:gamma3},
  keywords = {parallel I/O, database, performance analysis, Teradata, GAMMA,
  pario-bib},
  comment = {Compared Gamma with Teradata. Various operations on big relations.
  See fairly good linear speedup in many cases. They vary only one variable at
  a time. Their bottleneck was at the memory-network interface.}
}

@InProceedings{dewitt:gamma2,
  author = {David J. DeWitt and Robert H. Gerber and Goetz Graefe and Michael
  L. Heytens and Krishna B. Kumar and M. Muralikrishna},
  title = {{GAMMA} --- {A} High Performance Dataflow Database Machine},
  booktitle = {Proceedings of the 12th International Conference on Very Large
  Data Bases},
  year = {1986},
  pages = {228--237},
  earlier = {dewitt:gamma},
  later = {dewitt:gamma3},
  keywords = {parallel I/O, database, GAMMA, pario-bib},
  comment = {Almost identical to dewitt:gamma, with some updates. See that for
  comments, but cite this one. See also dewitt:gamma3 for a more recent paper.}
}

@Article{dewitt:gamma3,
  author = {David J. DeWitt and Shahram Ghandeharizadeh and Donovan A.
  Schneider and Allan Bricker and Hui-I Hsaio and Rick Rasmussen},
  title = {The {Gamma} Database Machine Project},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  year = {1990},
  month = {March},
  volume = {2},
  number = {1},
  pages = {44--62},
  publisher = {IEEE Computer Society Press},
  earlier = {dewitt:gamma2},
  keywords = {parallel I/O, database, GAMMA, pario-bib},
  comment = {An updated version of dewitt:gamma2, with elements of
  dewitt:gamma-dbm. Really only need to cite this one. This is the same basic
  idea as dewitt:gamma2, but after they ported the system from the VAXen to an
  iPSC/2. Speedup results good. Question: how about comparing it to a
  single-processor, single-disk system with increasing disk bandwidth? That is,
  how much of their speedup comes from the increasing disk bandwidth, and how
  much from the actual use of parallelism?}
}

@Article{dewitt:pardbs,
  author = {David DeWitt and Jim Gray},
  title = {Parallel Database Systems: The Future of High-Performance Database
  Systems},
  journal = {Communications of the ACM},
  year = {1992},
  month = {June},
  volume = {35},
  number = {6},
  pages = {85--98},
  keywords = {database, parallel computing, parallel I/O, pario-bib},
  comment = {They point out that the comments of boral:critique --- that
  database machines were doomed --- did really not come true. Their new thesis
  is that specialized hardware is not necessary and has not been successful,
  but that parallel database systems are clearly succesful. In particular, they
  argue for shared-nothing layouts. They survey the state-of-the-art parallel
  DB systems. Earlier version in Computer Architecture News 12/90.}
}

@InProceedings{dewitt:parsort,
  author = {David J. DeWitt and Jeffrey F. Naughton and Donovan A. Schneider},
  title = {Parallel Sorting on a Shared-Nothing Architecture using
  Probabilistic Splitting},
  booktitle = {Proceedings of the First International Conference on Parallel
  and Distributed Information Systems},
  year = {1991},
  month = {December},
  pages = {280--291},
  keywords = {parallel I/O, parallel database, external sorting, pario-bib},
  comment = {Comparing exact and probabilistic splitting for external sorting
  on a database. Model and experimental results from Gamma machine. Basically,
  the idea is to decide on a splitting vector, which defines $N$ buckets for an
  $N$-process program, and have each program read its initial segment of the
  data and send each element to the appropriate bucket (other process). All
  elements received are written to disks as small sorted runs. Then each
  process mergesorts its runs. Probabilistic split uses only a sample of the
  elements to define the vector.}
}

@InProceedings{dibble:bridge,
  author = {Peter Dibble and Michael Scott and Carla Ellis},
  title = {Bridge: {A} High-Performance File System for Parallel Processors},
  booktitle = {Proceedings of the Eighth International Conference on
  Distributed Computer Systems},
  year = {1988},
  month = {June},
  pages = {154--161},
  earlier = {ellis:interleaved},
  later = {dibble:thesis},
  URL = {ftp://ftp.cs.rochester.edu/pub/papers/systems/88.ICDCS.Bridge.ps.Z},
  keywords = {Carla, Bridge, multiprocessor file system, Butterfly, parallel
  I/O, pario-bib},
  comment = {See also dibble:*}
}

@Article{dibble:sort,
  author = {Peter C. Dibble and Michael L. Scott},
  title = {External Sorting on a Parallel Interleaved File System},
  journal = {University of Rochester 1989--90 Computer Science and Engineering
  Research Review},
  year = {1989},
  later = {dibble:sort2},
  keywords = {parallel I/O, sorting, merging, parallel file reference pattern,
  pario-bib},
  comment = {Based on Bridge file system (see dibble:bridge). Parallel external
  merge-sort tool. Sort file on each disk, then do a parallel merge. The merge
  is serialized by the token-passing mechanism, but the I/O time dominates. The
  key is to keep disks busy constantly. Uses some read-ahead, write-behind to
  control fluctuations in disk request timing. Analytical analysis of the
  algorithm lends insight and matches well with the timings. Locality is a big
  win in Bridge tools.}
}

@Article{dibble:sort2,
  author = {Peter C. Dibble and Michael L. Scott},
  title = {Beyond Striping: The {Bridge} Multiprocessor File System},
  journal = {Computer Architecture News},
  year = {1989},
  month = {September},
  volume = {19},
  number = {5},
  earlier = {dibble:sort},
  URL = {ftp://ftp.cs.rochester.edu/pub/papers/systems/89.CAN.Bridge.ps.Z},
  keywords = {parallel I/O, external sorting, merging, parallel file reference
  pattern, pario-bib},
  comment = {Subset of dibble:sort. Extra comments to distinguish from striping
  and RAID work. Good point that those projects are addressing a different
  bottleneck, and that they can provide essentially unlimited bandwidth to a
  single processor. Bridge could use those as individual file systems,
  parallelizing the overall file system, avoiding the software bottleneck.
  Using a very-reliable RAID at each node in Bridge could safeguard Bridge
  against failure for reasonable periods, removing reliability from Bridge
  level.}
}

@PhdThesis{dibble:thesis,
  author = {Peter C. Dibble},
  title = {A Parallel Interleaved File System},
  year = {1990},
  month = {March},
  school = {University of Rochester},
  keywords = {parallel I/O, external sorting, merging, parallel file system,
  pario-bib},
  comment = {Also TR 334. Mostly covered by other papers, but includes good
  introduction, discussion of reliability and maintenance issues, and
  implementation. Short mention of prefetching implied that simple OBL was
  counter-productive, but later tool-specific buffering with read-ahead was
  often important. The three interfaces to the PIFS server are interesting. A
  fourth compromise might help make tools easier to write.}
}

@Article{dickens:evaluation,
  author = {Phillip M. Dickens and Rajeev Thakur},
  title = {Evaluation of Collective {I/O} Implementations on Parallel
  Architectures},
  journal = {Journal of Parallel and Distributed Computing},
  year = {2001},
  month = {August},
  volume = {61},
  number = {8},
  pages = {1052--1076},
  publisher = {Academic Press},
  copyright = {Academic Press},
  URL = {http://www.idealibrary.com/links/doi/10.1006/jpdc.2000.1733},
  keywords = {parallel I/O, collective I/O, pario-bib, parallel architecture},
  abstract = {In this paper, we evaluate the impact on performance of various
  implementation techniques for collective I/O operations, and we do so across
  four important parallel architectures. We show that a naive implementation of
  collective I/0 does not result in significant performance gains for any of
  the architectures, but that an optimized implementation does provide
  excellent performance across all of the platforms under study. Furthermore,
  we demonstrate that there exists a single implementation strategy that
  provides the best performance for all four computational platforms. Next, we
  evaluate implementation techniques for thread-based collective I/O
  operations. We show that the most obvious implementation technique, which is
  to spawn a thread to execute the whole collective I/O operation in the
  background, frequently provides the worst performance, often performing much
  worse than just executing the collective I/O routine entirely in the
  foreground. To improve performance, we explore an alternate approach where
  part of the collective I/O operation is performed in the background, and part
  is performed in the foreground. We demonstrate that this implementation
  technique can provide significant performance gains, offering up to a 50\%
  improvement over implementations that do not attempt to overlap collective
  I/O and computation.}
}

@InProceedings{dickens:javaio,
  author = {Phillip M. Dickens and Rajeev Thakur},
  title = {An Evaluation of {Java's I/O} Capabilities for High-Performance
  Computing},
  booktitle = {Proceedings of the ACM 2000 Java Grande Conference},
  year = {2000},
  month = {June},
  pages = {26--35},
  publisher = {ACM Press},
  URL = {http://www.mcs.anl.gov/~thakur/papers/javaio.ps},
  keywords = {parallel I/O, Java, pario-bib},
  abstract = {Java is quickly becoming the preferred language for writing
  distributed applications because of its inherent support for programming on
  distributed platforms. In particular, Java provides compile-time and run-time
  security, automatic garbage collection, inherent support for multithreading,
  support for persistent objects and object migration, and portability. Given
  these significant advantages of Java, there is a growing interest in using
  Java for high-performance computing applications. To be successful in the
  high-performance computing domain, however, Java must have the capability to
  efficiently handle the significant I/O requirements commonly found in
  high-performance computing applications. \par While there has been
  significant research in high-performance I/O using languages such as C, C++,
  and Fortran, there has been relatively little research into the I/O
  capabilities of Java. In this paper, we evaluate the I/O capabilities of Java
  for high-performance computing. We examine several approaches that attempt to
  provide high-performance I/O---many of which are not obvious at first
  glance---and investigate their performance in both parallel and multithreaded
  environments. We also provide suggestions for expanding the I/O capabilities
  of Java to better support the needs of high-performance computing
  applications.}
}

@InProceedings{dickens:threads,
  author = {Phillip Dickens and Rajeev Thakur},
  title = {Improving Collective {I/O} Performance Using Threads},
  booktitle = {Proceedings of the Joint International Parallel Processing
  Symposium and IEEE Symposium on Parallel and Distributed Processing},
  year = {1999},
  month = {April},
  pages = {38--45},
  URL = {http://www.mcs.anl.gov/~thakur/papers/ipps99-thread-coll.ps},
  keywords = {parallel I/O, multithread programming, collective I/O,
  disk-directed I/O, two-phase I/O, pario-bib},
  abstract = {Massively parallel computers are increasingly being used to solve
  large, I/O intensive applications in many different fields. For such
  applications, the I/O requirements quite often present a significant obstacle
  in the way of achieving good performance, and an important area of current
  research is the development of techniques by which these costs can be
  reduced. One such approach is {\it collective I/O}, where the processors
  cooperatively develop an I/O strategy that reduces the number, and increases
  the size, of I/O requests, making a much better use of the I/O subsystem.
  Collective I/O has been shown to significantly reduce the cost of performing
  I/O in many large, parallel applications, and for this reason serves as an
  important base upon which we can explore other mechanisms which can further
  reduce these costs. One promising approach is to use threads to perform the
  collective I/O {\it in the background} while the main thread continues with
  other computation in the foreground. \par In this paper, we explore the
  issues associated with implementing collective I/O in the background using
  threads. The most natural approach is to simply spawn off an I/O thread to
  perform the collective I/O in the background while the main thread continues
  with other computation. However, our research demonstrates that this approach
  is frequently the {\it worst} implementation option, often performing much
  more poorly than just executing collective I/O completely in the foreground.
  To improve the performance of thread-based collective I/O, we developed an
  alternate approach where {\it part} of the collective I/O operation is
  performed in the background, and part is performed in the foreground. We
  demonstrate that this new technique can significantly improve the performance
  of thread-based collective I/O, providing up to an 80\% improvement over
  sequential collective I/O (where there is no attempt to overlap computation
  with I/O). Also, we discuss one very important application of this research
  which is the implementation of the {\it split-collective} parallel I/O
  operations defined in MPI 2.0.},
  comment = {They examine an implementation of collective I/O in MPI2 such that
  the collective I/O is done in the background, using a thread, while the
  computation continues. They found that the performance can be quite
  disappointing, because of the competition for the CPU between the
  computational thread and the background thread executing the redistribution
  phase of the I/O operation. They get better results by doing the
  redistribution in the foreground, making the computation wait, and then doing
  the I/O in the background thread while the computation continues. Results
  from four major parallel platforms, but only for write operations.}
}

@InProceedings{diegert:backprop,
  author = {Carl Diegert},
  title = {Out-of-core Backpropagation},
  booktitle = {International Joint Conference on Neural Networks},
  year = {1990},
  volume = {2},
  pages = {97--103},
  keywords = {parallel I/O, neural network, pario-bib},
  comment = {An application that reads large files, sequentially, on CM2 with
  DataVault.}
}

@InProceedings{ding:oceanmodel,
  author = {Chris H.~Q. Ding and Yun He},
  title = {Data Organization and {I/O} in a Parallel Ocean Circulation Model},
  booktitle = {Proceedings of SC99: High Performance Networking and Computing},
  year = {1999},
  month = {November},
  publisher = {ACM Press and IEEE Computer Society Press},
  address = {Portland, OR},
  URL = {http://www.sc99.org/proceedings/papers/ding.pdf},
  keywords = {scientific application, parallel I/O, ocean modeling, climate
  modeling, pario-bib},
  comment = {They describe the approaches taken to optimize an out-of-core
  parallel ocean model simulation on parallel distributed-memory machines. The
  original code used fixed size memory windows to store the in-core portions of
  dataset on the machine. The code used the same approach for machines that had
  enough memory to store the entire data set in-core, except rather than
  reading and writing to disk, the code copied to/from ramdisk (very copy
  intensive). The new code added an option to allow the entire dataset to be
  run in-core. Another place where the code could be optimized was in the
  writing of the dataset. For computational efficiency, the data was stored in
  memory as an array U(ix,iz,iy), but other applications needed the data stored
  on disk as U(ix,iy,iz). To optimize the I/O, the new code allocated
  additional processors to gather and re-organize and write the data to disk
  (much like Salvo).}
}

@InProceedings{drapeau:raid-ii,
  author = {Ann L. Drapeau and Ken W. Shirrif and John H. Hartman and Ethan L.
  Miller and Srinivasan Seshan and Randy H. Katz and Ken Lutz and David A.
  Patterson and Edward K. Lee and Peter H. Chen and Garth A. Gibson},
  title = {{RAID-II:} A High-Bandwidth Network File Server},
  booktitle = {Proceedings of the 21st Annual International Symposium on
  Computer Architecture},
  year = {1994},
  pages = {234--244},
  earlier = {chen:raid2},
  URL = {http://portal.acm.org/citation.cfm?id=191995.192031},
  keywords = {RAID, disk array, network file system, parallel I/O, pario-bib},
  comment = {See also chen:raid2. The only significant addition in this paper
  is a discussion of the performance of the RAID-II running an LFS file
  system.}
}

@InProceedings{drapeau:tape-stripe,
  author = {Ann L. Drapeau and Randy H. Katz},
  title = {Striping in Large Tape Libraries},
  booktitle = {Proceedings of Supercomputing '93},
  year = {1993},
  pages = {378--387},
  publisher = {IEEE Computer Society Press},
  address = {Portland, OR},
  keywords = {parallel I/O, pario-bib},
  comment = {RAID-3 striping across drives in a tape robot, using 3 data plus
  one parity. Tape-switch time is very high, ie, 4 minutes. Switching four
  tapes at the same time would only get a little overlap, because there is only
  one robot arm. Assume large request size. Striping is much faster when only
  one request is considered, but with many requests outstanding, response time
  goes way down due to limited concurrency. More readers with the same stripe
  group size alleviate the contention and allow concurrency. Faster readers is
  the most important thing to improve performance, more important than
  improving robot speed. As both speeds improve the benefit of striping
  diminishes. Seems like this could be expressed in a simple equation...}
}

@Article{dunigan:hypercubes,
  author = {T. H. Dunigan},
  title = {Performance of the {Intel iPSC/860} and {Ncube 6400} hypercubes},
  journal = {Parallel Computing},
  year = {1991},
  volume = {17},
  pages = {1285--1302},
  publisher = {North-Holland (Elsevier Scientific)},
  keywords = {intel, ncube, hypercube, multiprocessor architecture,
  performance, parallel I/O, pario-bib},
  comment = {An excellent paper presenting lots of detailed performance
  measurements on the iPSC/1, iPSC/2, iPSC/860, nCUBE 3200, and nCUBE 6400:
  arithmetic, FLOPS, communication, I/O. Tables of numbers provide details
  needed for simulation. iPSC/860 definitely is fastest, but way out of balance
  wrt communication vs. computation. Number of message hops is not so important
  in newer machines.}
}

@InProceedings{durand:coloring,
  author = {Dannie Durand and Ravi Jain and David Tseytlin},
  title = {Applying Randomized Edge Coloring Algorithms to Distributed
  Communication: An Experimental Study},
  booktitle = {Proceedings of the Seventh Symposium on Parallel Algorithms and
  Architectures},
  year = {1995},
  pages = {264--274},
  keywords = {parallel I/O, scheduling, pario-bib},
  comment = {They note that the set of data transfers in a parallel I/O
  architecture can be expressed as a graph coloring problem. Realistically, a
  centralized solution is not possible because the information is inherently
  distributed. So they develop some distributed algorithms and experimentally
  compare them to the centralized algorithm. They get within 5\% and do better
  than earlier algorithms.}
}

@Article{durand:edge-coloring,
  author = {Dannie Durand and Ravi Jain and David Tseytlin},
  title = {Parallel I/O scheduling using randomized, distributed edge coloring
  algorithms.},
  journal = {Journal of Parallel and Distributed Computing},
  year = {2003},
  month = {June},
  volume = {63},
  number = {6},
  pages = {611--618},
  institution = {Telcordia Technol, Appl Res, 445 South St, Morristown, NJ
  07960 USA; Telcordia Technol, Appl Res, Morristown, NJ 07960 USA},
  publisher = {Academic Press, 2003},
  copyright = {(c)2004 IEE; Institute for Scientific Information, Inc.},
  URL = {http://dx.doi.org/10.1016/S0743-7315(03)00015-7},
  keywords = {randomized edge coloring, scheduling algorithms, bipartite
  graphs, parallel I/O, pario-bib},
  abstract = {A growing imbalance in CPU (central processing unit) and I/O
  (input/output) speeds has led to a communications bottleneck in distributed
  architectures, especially for data intensive applications such as multimedia
  information systems, databases, and grand challenge problems. Our solution is
  to schedule parallel I/O operations explicitly. We present a class of
  decentralized scheduling algorithms that eliminate contention for I/O ports
  while maintaining an efficient use of bandwidth. These algorithms based on
  edge coloring and matching of bipartite graphs, rely upon simple heuristics
  to obtain shorter schedules. We use simulation to evaluate the ability of our
  algorithms to obtain near optimal solutions in a distributed context, and
  compare our work with that of other researchers. Our results show that our
  algorithms produce schedules within 5% of the optimal schedule, a substantial
  improvement over existing algorithms. (20 refs.)}
}

@InProceedings{durand:scheduling,
  author = {Dannie Durand and Ravi Jain and David Tseytlin},
  title = {Distributed Scheduling Algorithms to Improve the Performance of
  Parallel Data Transfers},
  booktitle = {Proceedings of the IPPS~'94 Workshop on Input/Output in Parallel
  Computer Systems},
  year = {1994},
  month = {April},
  pages = {85--104},
  organization = {Bellcore},
  note = {Also appeared in Computer Architecture News 22(4)},
  later = {durand:scheduling-book},
  keywords = {parallel I/O algorithms, pario-bib},
  comment = {They devise some decentralized algorithms to generate schedules
  for data transfers between a set of clients and a set of servers when the
  complete set of transfers is known in advance, and the clients and servers
  are fairly tightly synchronized. They concentrate on the limitation that
  clients and servers may each only participate in one transfer at any given
  moment; interconnect bandwidth is not an issue. Their simulations show that
  their algorithms come within 20\% of optimal.}
}

@InCollection{durand:scheduling-book,
  author = {Dannie Durand and Ravi Jain and David Tseytlin},
  title = {Improving the Performance of Parallel {I/O} Using Distributed
  Scheduling Algorithms},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {11},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {245--269},
  publisher = {Kluwer Academic Publishers},
  earlier = {durand:scheduling},
  keywords = {parallel I/O, distributed scheduling algorithm, pario-bib},
  abstract = {The cost of data transfers, and in particular of I/O operations,
  is a growing problem in parallel computing. This performance bottleneck is
  especially severe for data-intensive applications such as multimedia
  information systems, databases, and Grand Challenge problems. A promising
  approach to alleviating this bottleneck is to schedule parallel I/O
  operations explicitly. \par Although centralized algorithms for batch
  scheduling of parallel I/O operations have previously been developed, they
  are not be appropriate for all applications and architectures. We develop a
  class of decentralized algorithms for scheduling parallel I/O operations,
  where the objective is to reduce the time required to complete a given set of
  transfers. These algorithms, based on edge-coloring and matching of bipartite
  graphs, rely upon simple heuristics to obtain shorter schedules. We present
  simulation results indicating that the best of our algorithms can produce
  schedules whose length (or makespan) is within 2 - 20\% of the optimal
  schedule, a substantial improvement on previous decentralized algorithms. We
  discuss theoretical and experimental work in progress and possible
  extensions.},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@InProceedings{duzett:ncube3,
  author = {Bob Duzett and Ron Buck},
  title = {An Overview of the {nCUBE~3} Supercomputer},
  booktitle = {Proceedings of the Fourth Symposium on the Frontiers of
  Massively Parallel Computation},
  year = {1992},
  pages = {458--464},
  keywords = {parallel computer architecture, MIMD, pario-bib},
  comment = {Basically the same architecture as the nCUBE/2, scaled up. Eight
  to 65K processors, each 50 MIPS and 100 DP MFLOPS, initially 50 MHz. RISC. 16
  hypercube channels and 2 I/O channels per processor. CPU chip includes MMU,
  TLB, I- and D-cache, hypercube and I/O channels, and memory interface. The
  channels have DMA support built-in (5 usec startup overhead, worst-case
  end-to-end latency 10 usec), and can talk directly to the memory interface or
  to the cache. 64-bit virtual address space, with 48 bits implemented.
  Hardware support for distributed virtual memory. Separate 16-node hypercube
  is used for I/O processing, with up to 400 disks attached. Packaging includes
  multi-chip module with DRAMs stacked directly on the CPU chip, fluid-cooled,
  so that an entire node is one package, with the 18 network links as
  essentially its only external connections.}
}

@TechReport{edelson:pario,
  author = {Daniel Edelson and Darrell D. E. Long},
  title = {High Speed Disk {I/O} for Parallel Computers},
  year = {1990},
  month = {January},
  number = {UCSC-CRL-90-02},
  institution = {Baskin Center for Computer Engineering and Information
  Science},
  keywords = {parallel I/O, disk caching, parallel file system, log-structured
  file system, Intel iPSC/2, pario-bib},
  comment = {Essentially a small literature survey. No new ideas here, but it
  is a reasonable overview of the situation. Mentions caching, striping, disk
  layout optimization, log-structured file systems, and Bridge and Intel CFS.
  Plugs their ``Swift'' architecture (see cabrera:pario).}
}

@TechReport{el-ghazawi:mp1,
  author = {Tarek A. El-Ghazawi},
  title = {{I/O} Performance of the {MasPar MP-1} Testbed},
  year = {1994},
  number = {TR-94--111},
  institution = {CESDIS},
  address = {NASA GSFC, Greenbelt, MD},
  keywords = {parallel I/O, parallel architecture, performance evaluation,
  pario-bib},
  comment = {See el-ghazawi:mpio.}
}

@InProceedings{el-ghazawi:mpio,
  author = {Tarek A. El-Ghazawi},
  title = {Characteristics of the {MasPar} Parallel {I/O} System},
  booktitle = {Proceedings of the Fifth Symposium on the Frontiers of Massively
  Parallel Computation},
  year = {1995},
  pages = {265--272},
  keywords = {parallel I/O, parallel architecture, performance evaluation,
  pario-bib},
  comment = {See el-ghazawi:mp1.}
}

@TechReport{elford:ppfs-detail,
  author = {Chris Elford and Chris Kuszmaul and Jay Huber and Tara Madhyastha},
  title = {Portable Parallel File System Detailed Design},
  year = {1993},
  month = {November},
  institution = {University of Illinois at Urbana-Champaign},
  URL = {http://www-pablo.cs.uiuc.edu/Papers/PPFS-detail.ps.Z},
  keywords = {parallel file system, parallel I/O, pario-bib},
  comment = {See also elford:ppfs-tr, huber:ppfs.}
}

@TechReport{elford:ppfs-tr,
  author = {Chris Elford and Jay Huber and Chris Kuszmaul and Tara Madhyastha},
  title = {{PPFS} High Level Design Documentation},
  year = {1993},
  month = {November},
  institution = {University of Illinois at Urbana-Champaign},
  URL = {http://www-pablo.cs.uiuc.edu/Papers/PPFS-high.ps.Z},
  keywords = {parallel file system, parallel I/O, pario-bib},
  comment = {See also elford:ppfs-detail, huber:ppfs-scenarios, huber:ppfs.}
}

@Article{elford:trends,
  author = {Chris L. Elford and Daniel A. Reed},
  title = {Technology Trends and Disk Array Performance},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1997},
  month = {November},
  volume = {46},
  number = {2},
  pages = {136--147},
  keywords = {trends, disk technology, disk array, parallel I/O, pario-bib}
}

@TechReport{ellis:interleaved,
  author = {Carla Ellis and P. Dibble},
  title = {An Interleaved File System for the {Butterfly}},
  year = {1987},
  month = {January},
  number = {CS-1987-4},
  institution = {Dept. of Computer Science, Duke University},
  later = {dibble:bridge},
  keywords = {Carla, multiprocessor file system, Bridge, Butterfly, parallel
  I/O, pario-bib}
}

@InProceedings{ellis:prefetch,
  author = {Carla Schlatter Ellis and David Kotz},
  title = {Prefetching in File Systems for {MIMD} Multiprocessors},
  booktitle = {Proceedings of the 1989 International Conference on Parallel
  Processing},
  year = {1989},
  month = {August},
  pages = {I:306--314},
  publisher = {Pennsylvania State Univ. Press},
  copyright = {Pennsylvania State Univ. Press},
  address = {St. Charles, IL},
  earlier = {ellis:prefetchTR},
  later = {kotz:prefetch},
  keywords = {dfk, parallel file system, prefetching, disk caching, MIMD,
  parallel I/O, pario-bib},
  abstract = {The problem of providing file I/O to parallel programs has been
  largely neglected in the development of multiprocessor systems. There are two
  essential elements of any file system design intended for a highly parallel
  environment: parallel I/O and effective caching schemes. This paper
  concentrates on the second aspect of file system design and specifically, on
  the question of whether prefetching blocks of the file into the block cache
  can effectively reduce overall execution time of a parallel computation, even
  under favorable assumptions. \par Experiments have been conducted with an
  interleaved file system testbed on the Butterfly Plus multiprocessor. Results
  of these experiments suggest that 1) the hit ratio, the accepted measure in
  traditional caching studies, may not be an adequate measure of performance
  when the workload consists of parallel computations and parallel file access
  patterns, 2) caching with prefetching can significantly improve the hit ratio
  and the average time to perform an I/O operation, and 3) an improvement in
  overall execution time has been observed in most cases. In spite of these
  gains, prefetching sometimes results in increased execution times (a negative
  result, given the optimistic nature of the study). \par We explore why is it
  not trivial to translate savings on individual I/O requests into consistently
  better overall performance and identify the key problems that need to be
  addressed in order to improve the potential of prefetching techniques in this
  environment.}
}

@InProceedings{englert:nonstop,
  author = {Susanne Englert and Jim Gray and Terrye Kocher and Praful Shah},
  title = {A Benchmark of {NonStop SQL Release 2} Demonstrating Near-linear
  Speedup and Scaleup on Large Databases},
  booktitle = {Proceedings of the 1990 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1990},
  month = {May},
  pages = {245--246},
  keywords = {parallel database, parallel architecture, parallel I/O,
  pario-bib},
  abstract = {NonStop SQL is an implementation of ANSI/ISO SQL on Tandem
  Computer Systems. In its second release, NonStop SQL transparently and
  automatically implements parallelism within an SQL statement by exploiting
  Tandem's multiprocessor architecture. For basic queries on a uniform
  database, it achieves performance that is near-linear with respect to the
  number of processors and disks used. The authors describe benchmarks
  demonstrating these results and the technology used to achieve them.},
  comment = {They (briefly) describe the Tandem NonStop system, including their
  disk nodes (which contain CPU, memory, and disk) and their use. A query
  involves sending a request to all the disk nodes, who independently read the
  appropriate data from their local disk, filter out all the interesting
  records, and send only those interesting records to the originator for
  processing. This is an early example of smart (programmable) I/O nodes.}
}

@TechReport{esser:paragon,
  author = {R\"udiger Esser and Renate Knecht},
  title = {{Intel Paragon XP/S} --- Architecture and Software Environment},
  year = {1993},
  month = {April 26},
  number = {KFA-ZAM-IB-9305},
  institution = {Central Institute for Applied Mathematics, Research Center
  J\"ulich, Germany},
  address = {\verb+r.esser@kfa-juelich.de+},
  keywords = {multiprocessor architecture, pario-bib},
  comment = {A nice summary of the Paragon architecture and OS. Some
  information that is not found in Intel's technical summary, and with much
  less marketing hype. But, it was written in April 1993 with a look to the
  future, so it may represent things that are not ready yet. Network interface
  allows user-mode msgs, DMA direct to user space if receive has been posted;
  else there is a new queue for every possible sending processor. They plan to
  expand the nodes to 4-processors and 64-128 MB. PFS stripes across RAIDs. Now
  SCSI-1 with 5 MB/s, later 10 MB/s SCSI-2, then 20 MB/s fast SCSI-2. See also
  intel:paragon.}
}

@TechReport{falkenberg:server,
  author = {Charles Falkenberg and Paul Hagger and Steve Kelley},
  title = {A Server of Distributed Disk Pages Using a Configurable Software
  Bus},
  year = {1993},
  month = {July},
  number = {CS-TR-3082},
  institution = {Dept. of Computer Science, University of Maryland},
  note = {Also cross-referenced as UMIACS-TR-93-47},
  URL = {http://www.cs.umd.edu/TR/UMCP-CSD:CS-TR-3082},
  keywords = {parallel I/O, network, virtual memory, parallel database,
  pario-bib},
  abstract = {As network latency drops below disk latency, access time to a
  remote disk will begin to approach local disk access time. The performance of
  I/O may then be improved by spreading disk pages across several remote disk
  servers and accessing disk pages in parallel. To research this we have
  prototyped a data page server called a Page File. This persistent data type
  provides a set of methods to access disk pages stored on a cluster of remote
  machines acting as disk servers. The goal is to improve the throughput of
  database management system or other I/O intensive application by accessing
  pages from remote disks and incurring disk latency in parallel. This report
  describes the conceptual foundation and the methods of access for our
  prototype.},
  comment = {An early document on a system under development. It declusters
  pages of a file across many page servers, and provides an abstraction of a
  linearly ordered collection of pages. The intended use is by database
  systems. As it stands now, there is little here other than block
  declustering, and thus, nothing new to the I/O community. Perhaps later they
  will develop interesting new caching or prefetching strategies.}
}

@InProceedings{fallah-adl:data,
  author = {Hassan Fallah-Adl and Joseph J\'aJ\'a and Shunlin Liang and Yoram
  J. Kaufman and John Townshend},
  title = {Efficient Algorithms for Atmospheric Correction of Remotely Sensed
  Data},
  booktitle = {Proceedings of Supercomputing '95},
  year = {1995},
  publisher = {IEEE Computer Society Press},
  address = {San Diego, CA},
  URL = {http://www.supercomp.org/sc95/proceedings/511_HFAD/SC95.HTM},
  keywords = {remote sensing, parallel I/O application, pario-bib},
  abstract = {Remotely sensed imagery has been used for developing and
  validating various studies regarding land cover dynamics. However, the large
  amounts of imagery collected by the satellites are largely contaminated by
  the effects of atmospheric particles. The objective of atmospheric correction
  is to retrieve the surface reflectance from remotely sensed imagery by
  removing the atmospheric effects. We introduce a number of computational
  techniques that lead to a substantial speedup of an atmospheric correction
  algorithm based on using look-up tables. Excluding I/O time, the previous
  known implementation processes one pixel at a time and requires about 2.63
  seconds per pixel on a SPARC-10 machine, while our implementation is based on
  processing the whole image and takes about 4-20 microseconds per pixel on the
  same machine. We also develop a parallel version of our algorithm that is
  scalable in terms of both computation and I/O. Experimental results obtained
  show that a Thematic Mapper (TM) image (36 MB per band, 5 bands need to be
  corrected) can be handled in less than 4.3 minutes on a 32-node CM-5 machine,
  including I/O time.},
  comment = {Note proceedings only on CD-ROM or WWW.}
}

@InCollection{feitelson:bpario,
  author = {Dror G. Feitelson and Peter F. Corbett and Sandra Johnson Baylor
  and Yarsun Hsu},
  title = {Parallel {I/O} Subsystems in Massively Parallel Supercomputers},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {25},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {389--407},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {feitelson:pario},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {multiprocessor file system, parallel I/O, Vesta, pario-bib},
  comment = {Part of jin:io-book. Excellent survey. Reformatted version of
  feitelson:pario.}
}

@Article{feitelson:pario,
  author = {Dror G. Feitelson and Peter F. Corbett and Sandra Johnson Baylor
  and Yarsun Hsu},
  title = {Parallel {I/O} Subsystems in Massively Parallel Supercomputers},
  journal = {IEEE Parallel and Distributed Technology},
  year = {1995},
  month = {Fall},
  volume = {3},
  number = {3},
  pages = {33--47},
  publisher = {IEEE Computer Society Press},
  earlier = {feitelson:pario-tr},
  later = {feitelson:bpario},
  keywords = {multiprocessor file system, parallel I/O, Vesta, pario-bib},
  abstract = {Applications executing on massively parallel supercomputers
  require a high aggregate bandwidth of I/O with low latency. This requirement
  cannot be satisfied by an external file server. Once solution is to employ an
  internal parallel I/O subsystem, in which I/O nodes with DASD are linked wo
  the same interconnection network that connects the compute nodes. The option
  of increasing the number of I/O nodes together with the number of compute
  nodes allows for a balanced architecture. Indeed, most multicomputer vendors
  provide internal parallel I/O subsystems as part of their product offerings.
  However, these systems typically attempt to preserve a Unix-compatible
  interface, hiding or abstracting the parallelism. New interfaces may be
  required to fully utilize the capabilities of Parallel I/O.},
  comment = {A very nice survey of multiprocessor file systems issues.
  Published version of feitelson:pario-tr.}
}

@TechReport{feitelson:pario-tr,
  author = {Dror G. Feitelson and Peter F. Corbett and Sandra Johnson Baylor
  and Yarsun Hsu},
  title = {Satisfying the {I/O} Requirements of Massively Parallel
  Supercomputers},
  year = {1993},
  month = {July},
  number = {Research Report RC 19008 (83016)},
  institution = {IBM T. J. Watson Research Center},
  later = {feitelson:pario},
  keywords = {multiprocessor file system, parallel I/O, Vesta, pario-bib,
  OS94W},
  comment = {A very nice survey of multiprocessor file systems issues. They
  make a good point that I/O {\em needs\/} would increase if I/O capabilities
  increase, because people would output more interations, more complete data
  sets, etc.\ They make the case for internal file systems, the use of
  dedicated I/O nodes, the attachment of every RAID to two I/O nodes for
  reliability, the Vesta interface, and user control over the view of a
  parallel file. See also corbett:vesta*. Published as feitelson:pario.}
}

@InProceedings{feitelson:terminal,
  author = {Dror G. Feitelson},
  title = {Terminal {I/O} for Massively Parallel Systems},
  booktitle = {Proceedings of the Scalable High-Performance Computing
  Conference},
  year = {1994},
  pages = {263--270},
  keywords = {parallel I/O, pario-bib},
  comment = {How to deal with stdin/stdout on a parallel processor. Basically,
  each task is given its own window, where the user can see the output and type
  input to that task. Then, they have a window for LEDs, ie, little squares,
  one for each task. The square changes color depending on the situation. The
  default is to turn green when output is available, red when waiting for
  input, and white when the window is currently open. Clicking on these opens
  the appropriate window, so there is some control over which windows you are
  watching. They also provide a programmer interface to allow the programmer to
  control the LED color.}
}

@InProceedings{feitelson:vesta-perf,
  author = {Dror G. Feitelson and Peter F. Corbett and Jean-Pierre Prost},
  title = {Performance of the {Vesta} Parallel File System},
  booktitle = {Proceedings of the Ninth International Parallel Processing
  Symposium},
  year = {1995},
  month = {April},
  pages = {150--158},
  earlier = {feitelson:vesta-perf-tr},
  keywords = {parallel I/O, multiprocessor file system, Vesta, pario-bib},
  comment = {See feitelson:vesta-perf-tr.}
}

@TechReport{feitelson:vesta-perf-tr,
  author = {Dror G. Feitelson and Peter F. Corbett and Jean-Pierre Prost},
  title = {Performance of the {Vesta} Parallel File System},
  year = {1994},
  month = {September},
  number = {RC~19760 (87534)},
  institution = {IBM T.J. Watson Research Center},
  address = {Yorktown Heights, NY 10598},
  later = {feitelson:vesta-perf},
  URL = {http://www.watson.ibm.com:8080/PS/157.ps.gz},
  keywords = {parallel I/O, multiprocessor file system, Vesta, pario-bib},
  comment = {Cite feitelson:vesta-perf. A good performance study of Vesta
  running on an SP-1. See corbett:jvesta for ultimate reference. In all, Vesta
  performed very well both for single-node and multiple-node performance. I
  wish that they had tried some very small BSUs; at one point they tried
  16-byte BSUs and the performance looked very poor. Section on I/O vectors is
  confusing.}
}

@InCollection{feitelson:xml,
  author = {Dror G. Feitelson and Tomer Klainer},
  title = {{XML}, Hyper-media, and {Fortran I/O}},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {43},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {633--644},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, parallel I/O interface, pario-bib},
  comment = {Part of jin:io-book.}
}

@Article{feng:io-response,
  author = {Dan Feng and Hong Jiang and Yifeng Zhu},
  title = {{I/O} response time in a fault-tolerant parallel virtual file
  system},
  journal = {Lecture Notes in Computer Science},
  booktitle = {IFIP International Conference on Network and Parallel Computing;
  October 18-20, 2004; Wuhan, PEOPLES R CHINA},
  editor = {Jin, H; Gao, GR; Xu, ZW; Chen, H},
  year = {2004},
  month = {October},
  volume = {3222},
  pages = {248--251},
  institution = {Huazhong Univ Sci \& Technol, Coll Comp, Minist Educ, Key Lab
  Data Storage Syst, Wuhan 430074, Peoples R China; Univ Nebraska, Dept Comp
  Sci \& Engn, Lincoln, NE 68588 USA},
  publisher = {Springer-Verlag Heidelberg},
  copyright = {(c)2004 The Thomson Corporation},
  URL = {http://springerlink.metapress.com/link.asp?id=484ru5hgyxegr5r2},
  keywords = {fault-tolerance, PVFS, perforamance analysis, pario-bib},
  abstract = {A fault tolerant parallel virtual file system is designed and
  implemented to provide high I/O performance and high reliability. A queuing
  model is used to analyze in detail the average response time when multiple
  clients access the system. The results show that I/O response time is with a
  function of several operational parameters. It decreases with the increase in
  I/O buffer hit rate for read requests, write buffer size for write requests
  and number of server nodes in the parallel file system, while higher I/O
  requests arrival rate increases I/O response time.}
}

@Article{feng:performance,
  author = {Dan Feng and Hong Jiang and Yi-Feng Zhu},
  title = {{I/O} performance of an {RAID-10} style parallel file system},
  journal = {Journal of Computer Science and Technology},
  year = {2004},
  month = {November},
  volume = {19},
  number = {6},
  pages = {965--972},
  institution = {Huazhong Univ Sci \& Technol, Dept Comp Sci \& Engn, Natl
  Storage Syst Lab, Wuhan 430074, Peoples R China; Huazhong Univ Sci \&
  Technol, Dept Comp Sci \& Engn, Natl Storage Syst Lab, Wuhan 430074, Peoples
  R China; Univ Nebraska, Dept Comp Sci \& Engn, Lincoln, NE USA},
  publisher = {SCIENCE CHINA PRESS},
  copyright = {(c)2005 The Thomson Corporation},
  URL = {http://jcst.ict.ac.cn/cone/cone46.html#paper29},
  keywords = {PVFS, parallel I/O, I/O response time, pario-bib},
  abstract = {Without any additional cost, all the disks on the nodes of a
  cluster can be connected together through CEFT-PVFS, an RAID-10 style
  parallel file system, to provide a multi-GB/s parallel I/O performance. I/O
  response time is one of the most important measures of quality of service for
  a client. When multiple clients submit data-intensive jobs at the same time,
  the response time experienced by the user is an indicator of the power of the
  cluster. In this paper, a queuing model is used to analyze in detail the
  average response time when multiple clients access CEFT-PVFS. The results
  reveal that response time is with a function of several operational
  parameters. The results show that I/O response time decreases with the
  increases in I/O buffer hit rate for read requests, write buffer size for
  write requests and the number of server nodes in the parallel file system,
  while the higher the I/O requests arrival rate, the longer the I/O response
  time. On the other hand, the collective power of a large cluster supported by
  CEFT-PVFS is shown to be able to sustain a steady and stable I/O response
  time for a relatively large range of the request arrival rate.}
}

@InProceedings{ferragina:soda96,
  author = {Paolo Ferragina and Roberto Grossi},
  title = {Fast String Searching in Secondary Storage: Theoretical Developments
  and Experimental Results},
  booktitle = {Proceedings of the ACM-SIAM Symposium on Discrete Algorithms
  (SODA `96)},
  year = {1996},
  month = {June},
  pages = {373--382},
  publisher = {ACM Press},
  address = {Atlanta},
  URL = {http://www.di.unipi.it/~ferragin/Latex/jsoda96.ps.gz},
  keywords = {out-of-core algorithm, parallel I/O, pario-bib},
  abstract = {In a previous work [Ferragina-Grossi, ACM STOC 95], we proposed a
  text indexing data structure for secondary storage, which we called SB-tree,
  that combines the best of B-trees and suffix arrays, overcoming the
  limitations of inverted files, suffix arrays, suffix trees, and prefix
  B-trees. In this paper we study the performance of SB-trees in a practical
  setting, performing a set of searching and updating experiments. Improved
  performance was obtained by a new space efficient and alphabet-independent
  organization of the internal nodes of the SB-tree, and a new batch insertion
  procedure that avoids thrashing.}
}

@InProceedings{ferragina:stoc95,
  author = {Paolo Ferragina and Roberto Grossi},
  title = {A Fully-dynamic data structure for external substring search},
  booktitle = {Proceedings of the 27th Annual ACM Symposium on Theory of
  Computing},
  year = {1995},
  pages = {693--702},
  publisher = {ACM Press},
  address = {Las Vegas},
  URL =
  {http://www.acm.org/pubs/articles/proceedings/stoc/225058/p693-ferragina/p693-ferragina.pdf},
  keywords = {out-of-core algorithm, parallel I/O, pario-bib}
}

@Article{ferreira:data-intensive,
  author = {Renato Ferreira and Gagan Agrawal and Joel Saltz},
  title = {Data parallel language and compiler support for data intensive
  applications},
  journal = {Parallel Computing},
  year = {2002},
  month = {May},
  volume = {28},
  number = {5},
  pages = {725--748},
  publisher = {Elsevier Science},
  URL = {http://www.elsevier.com/gej-ng/10/35/21/60/57/30/abstract.html},
  keywords = {parallel I/O, parallel applications, data parallel, pario-bib},
  abstract = {Processing and analyzing large volumes of data plays an
  increasingly important role in many domains of scientific research.
  High-level language and compiler support for developing applications that
  analyze and process such datasets has, however, been lacking so far. \par In
  this paper, we present a set of language extensions and a prototype compiler
  for supporting high-level object-oriented programming of data intensive
  reduction operations over multidimensional data. We have chosen a dialect of
  Java with data-parallel extensions for specifying a collection of objects, a
  parallel for loop, and reduction variables as our source high-level language.
  Our compiler analyzes parallel loops and optimizes the processing of datasets
  through the use of an existing run-time system, called active data repository
  (ADR). We show how loop fission followed by interprocedural static program
  slicing can be used by the compiler to extract required information for the
  run-time system. We present the design of a compiler/run-time interface which
  allows the compiler to effectively utilize the existing run-time system. \par
  A prototype compiler incorporating these techniques has been developed using
  the Titanium front-end from Berkeley. We have evaluated this compiler by
  comparing the performance of compiler generated code with hand customized ADR
  code for three templates, from the areas of digital microscopy and scientific
  simulations. Our experimental results show that the performance of compiler
  generated versions is, on the average 21% lower, and in all cases within a
  factor of two, of the performance of hand coded versions.}
}

@InProceedings{ferreira:microscope,
  author = {Renato Ferreira and Bongki Moon and Jim Humphries and Alan Sussman
  and Joel Saltz and Robert Miller and Angelo Demarzo},
  title = {The Virtual Microscope},
  booktitle = {American Medical Informatics Association, 1997 Annual Fall
  Symposium},
  year = {1997},
  month = {October},
  pages = {449--453},
  address = {Nashville, TN},
  URL = {http://www.cs.arizona.edu/~bkmoon/papers/amia97.ps},
  keywords = {pario-bib, application},
  comment = {Best Application Paper award. \par This paper describes a
  client/server application that emulates a high power light microscope. They
  use wavelet compression to reduce the size of each of the electronic slides
  and they use a parallel data server much like the ones used for sattelite
  image data (see chang:titan) to service data requests.}
}

@InProceedings{fineberg:nht1,
  author = {Samuel A. Fineberg},
  title = {Implementing the {NHT-1} application {I/O} benchmark},
  booktitle = {Proceedings of the IPPS~'93 Workshop on Input/Output in Parallel
  Computer Systems},
  year = {1993},
  pages = {37--55},
  address = {Newport Beach, CA},
  note = {Also published in Computer Architecture News 21(5), December 1993,
  pages 23--30},
  keywords = {parallel I/O, multiprocessor file system, benchmark, pario-bib},
  comment = {See also carter:benchmark. Some preliminary results from one of
  their benchmarks. Note: ``I was only using a single Cray disk with a maximum
  transfer rate of 9.6MBytes/sec.'' --- Fineberg.}
}

@InProceedings{fineberg:pmpio,
  author = {Samuel A. Fineberg and Parkson Wong and Bill Nitzberg and Chris
  Kuszmaul},
  title = {{PMPIO}--- A Portable Implementation of {MPI-IO}},
  booktitle = {Proceedings of the Sixth Symposium on the Frontiers of Massively
  Parallel Computation},
  year = {1996},
  month = {October},
  pages = {188--195},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, pario-bib},
  abstract = {MPI-IO provides a demonstrably efficient portable parallel
  Input/Output interface, compatible with the MPI standard. PMPIO is a
  "reference implementation" of MPI-IO, developed at NASA Ames Research Center.
  To date, PMPIO has been ported to the IBM SP-2, SGI and Sun shared memory
  workstations, the Intel Paragon, and the Cray J90. Preliminary results using
  the PMPIO implementation of MPI-IO show an improvement of as much as a factor
  of 20 on the NAS BTIO benchmark compared to a Fortran based implementation.
  We show comparative results on the SP-2 Paragon, and SGI architectures.}
}

@InProceedings{flynn:hyper-fs,
  author = {Robert J. Flynn and Haldun Hadimioglu},
  title = {A Distributed Hypercube File System},
  booktitle = {Proceedings of the Third Conference on Hypercube Concurrent
  Computers and Applications},
  year = {1988},
  pages = {1375--1381},
  publisher = {ACM Press},
  address = {Pasadena, CA},
  keywords = {parallel I/O, hypercube, parallel file system, pario-bib},
  comment = {For hypercube-like architectures. Interleaved files, though
  flexible. Separate network for I/O, maybe not hypercube. I/O is blocked and
  buffered -- no coherency or prefetching issues discussed. Buffered close to
  point of use. Parallel access is ok. Broadcast supported? I/O nodes
  distinguished from comp nodes. I/O hooked to front-end too. See hadimioglu:fs
  and hadimioglu:hyperfs}
}

@Article{ford:rail,
  author = {Daniel A. Ford and Robert J.~T. Morris and Alan E. Bell},
  title = {Redundant arrays of independent libraries ({RAIL}): the {StarFish}
  tertiary storage system},
  journal = {Parallel Computing},
  year = {1998},
  month = {January},
  volume = {24},
  number = {1},
  pages = {45--64},
  URL = {http://dx.doi.org/10.1016/S0167-8191(97)00116-6},
  keywords = {parallel I/O, redundant data, striping, tertiary storage,
  pario-bib},
  abstract = {Increased computer networking has sparked a resurgence of the
  `on-line' revolution of the 1970's, making ever larger amounts of data
  available on a world wide basis and placing greater demands on the
  performance and availability of tertiary storage systems. In this paper, we
  argue for a new approach to tertiary storage system architecture that is
  obtained by coupling multiple small and inexpensive `building block'
  libraries (or jukeboxes) together to create larger tertiary storage systems.
  We call the resulting system a RAIL and show that it has performance and
  availability characteristics superior to conventional tertiary storage
  systems, for almost the same dollar/megabyte cost. A RAIL system is the
  tertiary storage equivalent of a fixed magnetic disk RAID storage system, but
  with several additional features that enable the ideas of data striping and
  redundancy to function efficiently on dismountable media and robotic media
  mounting systems. We present the architecture of such a system called
  Starfish I and describe the implementation of a prototype. We also introduce
  the idea of creating a log-structured library array (LSLA) on top of a RAIL
  architecture (StarFish II) and show how it can have write performance
  equivalent to that of secondary storage, and improved read performance along
  with other advantages such as easier compression and the elimination of the
  4*RAID/RAIL write penalty.},
  comment = {Part of a special issue.}
}

@InCollection{foster:arrays,
  author = {Ian Foster and Jarek Nieplocha},
  title = {Disk Resident Arrays: An Array-Oriented I/O Library for Out-of-Core
  Computations},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {33},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {488--498},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {nieplocha:arrays},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, pario-bib},
  comment = {Part of jin:io-book; reformatted version of nieplocha:arrays.}
}

@Misc{foster:chemio,
  author = {Ian Foster and Jarek Nieplocha},
  title = {{ChemIO}: High-Performance {I/O} for Computational Chemistry
  Applications},
  year = {1996},
  month = {February},
  howpublished = {WWW \mbox{http://www.mcs.anl.gov/chemio/}},
  later = {nieplocha:chemio},
  URL = {http://www.mcs.anl.gov/chemio/},
  keywords = {computational science, chemistry, parallel I/O, pario-bib},
  comment = {A library package for computational chemistry programs. It
  supports out-of-core arrays. See also nieplocha:chemio.}
}

@TechReport{foster:climate,
  author = {Ian Foster and Mark Henderson and Rick Stevens},
  title = {Data Systems for Parallel Climate Models},
  year = {1991},
  month = {July},
  number = {ANL/MCS-TM-169},
  institution = {Argonne National Laboratory},
  note = {Copies of slides from a workshop by this title, with these
  organizers},
  keywords = {parallel I/O, parallel database, multiprocessor file system,
  climate model, grand challenge, tertiary storage, archival storage, RAID,
  tape robot, pario-bib},
  comment = {Includes the slides from many presenters covering climate
  modeling, data requirements for climate models, archival storage systems,
  multiprocessor file systems, and so forth. NCAR data storage growth rates
  (p.~54), 500 bytes per MFlop, or about 8~TB/year with Y/MP-8. Average file
  length 26.2~MB. Migration across both storage hierarchy and generations of
  media. LLNL researcher: typical 50-year, 3-dimensional model with 5-degree
  resolution will produce 75~GB of output. Attendee list included.}
}

@InProceedings{foster:remote-io,
  author = {Ian Foster and David {Kohr, Jr.} and Rakesh Krishnaiyer and Jace
  Mogill},
  title = {Remote {I/O}: Fast Access to Distant Storage},
  booktitle = {Proceedings of the Fifth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1997},
  month = {November},
  pages = {14--25},
  publisher = {ACM Press},
  address = {San Jose, CA},
  URL = {http://doi.acm.org/10.1145/266220.266222},
  keywords = {parallel I/O, distributed file system, pario-bib},
  abstract = {As high-speed networks make it easier to use distributed
  resources, it becomes increasingly common that applications and their data
  are not colocated. Users have traditionally addressed this problem by
  manually staging data to and from remote computers. We argue instead for a
  remote I/O paradigm in which programs use familiar parallel I/O interfaces to
  access remote filesystems. In addition to simplifying remote execution,
  remote I/O can improve performance relative to staging by overlapping
  computation and data transfer or by reducing communication requirements.
  However, remote I/O also introduces new technical challenges in the areas of
  portability, performance, and integration with distributed computing systems.
  We propose techniques designed to address these challenges and describe a
  remote I/O library called RIO that we are developing to evaluate the
  effectiveness of these techniques. RIO addresses issues of portability by
  adopting the quasi-standard MPI-IO interface and by defining a RIO device and
  RIO server within the ADIO abstract I/O device architecture. It addresses
  performance issues by providing traditional I/O optimizations such as
  asynchronous operations and through implementation techniques such as
  buffering and message forwarding to offload communication overheads.
  Microbenchmarks and application experiments demonstrate that our techniques
  can improve turnaround time relative to staging.},
  comment = {They want to support users that have datasets at different
  locations in the Internet, but need to access the data at supercomputer
  parallel machines. Rather than staging data in and out, they want to provide
  remote access. Issues: naming, dynamic loads, heterogeneity, security,
  fault-tolerance. All traffic goes through a 'forwarder node' that funnels all
  the traffic into the network. They use URLs for pathnames (e.g.,
  "x-rio://..."). They find that non-blocking ops are important, as is
  collective I/O. They think that buffering will be important. Limited
  experiments.}
}

@Book{fox:cubes,
  author = {G. Fox and M. Johnson and G. Lyzenga and S. Otto and J. Salmon and
  D. Walker},
  title = {Solving Problems on Concurrent Processors},
  year = {1988},
  volume = {1},
  publisher = {Prentice Hall},
  address = {Englewood Cliffs, NJ},
  keywords = {hypercube, pario-bib},
  comment = {See fox:cubix for parallel I/O.}
}

@InBook{fox:cubix,
  author = {G. Fox and M. Johnson and G. Lyzenga and S. Otto and J. Salmon and
  D. Walker},
  title = {Solving Problems on Concurrent Processors},
  chapter = {6 and 15},
  crossref = {fox:cubes},
  year = {1988},
  volume = {1},
  publisher = {Prentice Hall},
  address = {Englewood Cliffs, NJ},
  keywords = {parallel file system, hypercube, pario-bib},
  comment = {Parallel I/O control, called CUBIX. Interesting method. Depends a
  lot on ``loose synchronization'', which is sortof SIMD-like.}
}

@InProceedings{franke:filters,
  author = {Ernest Franke and Michael Magee},
  title = {Reducing Data Distribution Bottlenecks by Employing Data
  Visualization Filters},
  booktitle = {Proceedings of the Eighth IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1999},
  month = {August},
  pages = {255--262},
  publisher = {IEEE Computer Society Press},
  address = {Redondo Beach, CA},
  URL = {http://computer.org/conferen/proceed/hpdc/0287/02870041abs.htm},
  keywords = {distributed computing, filters, grid, input/output, parallel I/O,
  pario-bib, app-pario},
  abstract = {Between 1994 and 1997, researchers at Southwest Research
  Institute (SwRI) investigated methods for distributing parallel computation
  and data visualization under the support of an internally funded Research
  Initiative Program entitled the Advanced Visualization Technology Project
  (AVTP). A hierarchical data cache architecture was developed to provide a
  flexible interface between the modeling or simulation computational processes
  and data visualization programs. Compared to conventional post facto data
  visualization approaches, this data cache structure provides many advantages
  including simultaneous data access by multiple visualization clients,
  comparison of experimental and simulated data, and visual analysis of
  computer simulation as computation proceeds. \par However, since the data
  cache was resident on a single workstation, this approach did not address the
  issue of scalability of methods for avoiding the data storage bottleneck by
  distributing the data across multiple networked workstations. Scalability
  through distributed database approaches is being investigated as part of the
  Applied Visualization using Advanced Network Technology Infrastructure
  (AVANTI) project.\par This paper describes a methodology currently under
  development that is intended to avoid bottlenecks that typically arise as the
  result of data consumers (e.g. visualization applications) that must access
  and process large amounts of data that has been generated and resides on
  other hosts, and which must pass through a central data cache prior to being
  used by the data consumer. The methodology is based on a fundamental paradigm
  that the end result (visualization) rendered by a data consumer can, in many
  cases, be produced using a reduced data set that has been distilled or
  filtered from the original data set. \par In the most basic case, the
  filtered data used as input to the data consumer may simply be a proper
  subset of massive data sets that have been distributed among hosts. For the
  general case, however, the filtered data may bear no resemblance to the
  original data since it is the result of processing the raw data set and
  distilling it to its visual "essence", i.e. the minimal data set that is
  absolutely required by the data consumer in order to perform the required
  rendering function. Data distribution bottlenecks for visualization
  applications are thereby reduced by avoiding the transfer of large amounts of
  raw data in favor of considerably distilled visual data.\par There are, of
  course, computational costs associated with this approach since raw data must
  be processed into its visual essence, but these computational costs may be
  distributed among multiple processors. It should be realized, however, that,
  in general, these computational costs would exist any way since, for the
  visualization to be performed, there must be a transformation between the raw
  data and the visualization primitives (e.g. line segments, polygon vertices,
  etc.) to be rendered. The main principal put forth by this paper is that if
  data distribution bottlenecks are to be minimized, the amount of raw data
  transferred should be reduced by employing data filtering processes that can
  be distributed among multiple hosts. \par The complete paper demonstrates,
  both analytically and experimentally, that this approach becomes increasingly
  effective (scalable) as the computational expense associated with the data
  filtering transformation rises.},
  comment = {The goal of their work is to improve the performance of data
  visualization applications which use remote the data generators (disk or
  running application) and data consumers (the visualization station) for
  visualzation applications. They deal with network bottlenecks by using a
  distributed-redundant data cache to hold intermediate data between the data
  generator and the data consumer. They also reduce network traffic by applying
  data filters to the data at the distributed cache processors. The main
  argument is that since the data must be filtered before it is visualized, it
  makes more sense to perform the filter at the data cache so the computation
  can be distributed and to reduce the amount of data that needs to be
  transferred to the data consumer.}
}

@Article{freedman:spiffi,
  author = {Craig S. Freedman and Josef Burger and David J. Dewitt},
  title = {{SPIFFI} --- A Scalable Parallel File System for the {Intel
  Paragon}},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {1996},
  month = {November},
  volume = {7},
  number = {11},
  pages = {1185--1200},
  publisher = {IEEE Computer Society Press},
  URL = {http://www.computer.org/tpds/td1996/l1185abs.htm},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {The complete paper on the SPIFFI parallel file system. It seems to
  be much like Intel CFS from the programmer's point of view, with a few new
  file modes, user-selectable striping granularity. Their Paragon, though a
  source of problems, had a disk on every node (though they do not take
  advantage of that in this work). They have a buffer pool on each I/O node,
  which does prefetching in a somewhat novel way.}
}

@InProceedings{freedman:video,
  author = {Craig S. Freedman and David J. DeWitt},
  title = {The {SPIFFI} Scalable Video-on-Demand System},
  booktitle = {Proceedings of the ACM SIGMOD International Conference on
  Management of Data},
  year = {1995},
  pages = {352--363},
  publisher = {ACM Press},
  keywords = {parallel file system, multimedia, video server, pario-bib},
  comment = {See also freedman:spiffi. They simulate their video-on-demand
  server. Their model is a cluster of workstation servers, connected by a
  network to video-display terminals. The terminals just have a circular buffer
  queue that they fill by making requests to the server, and drain by
  uncompressing MPEG and displaying video. The servers manage a buffer pool and
  a set of striped disks. All videos are striped across all disks. They use
  dual LRU lists in the server buffer pool: one for used blocks, and one for
  prefetched blocks (``love prefetching''). They use a ``real-time'' disk
  scheduling algorithm that prioritizes requests by their deadlines (or
  anticipated deadline in case of a prefetch). Their metric is maximum number
  of terminals that can be supported without glitches. They plan to implement
  their system on a workstation cluster.}
}

@InProceedings{freitag:visualization,
  author = {Lori A. Freitag and Raymond M. Loy},
  title = {Adaptive, Multiresolution Visualization of Large Data Sets using a
  Distributed Memory Octree},
  booktitle = {Proceedings of SC99: High Performance Networking and Computing},
  year = {1999},
  month = {November},
  publisher = {ACM Press and IEEE Computer Society Press},
  address = {Portland, OR},
  URL = {http://www.sc99.org/proceedings/papers/loy.pdf},
  keywords = {interactive visualization, multi-resolution visualization,
  adaptive visualization, scientific application, parallel octrees, pario-bib},
  comment = {They describe a technique that combines heirarchical data
  reduction methods with parallel computing to allow "interactive exploration
  of large data sets while retaining full-resolution capabilities." They point
  out that visualization of large data sets requires a post-processing step to
  reduce the size, or sophisticated rendering algorithms that work with the
  full resolution. There method combines the two techniques.}
}

@InProceedings{french:balance,
  author = {James C. French},
  title = {Characterizing the Balance of Parallel {I/O} Systems},
  booktitle = {Proceedings of the Sixth Annual Distributed-Memory Computer
  Conference},
  year = {1991},
  pages = {724--727},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {Proposes the min\_SAR, max\_SAR, and ratio phi as measures of
  aggregate file system bandwidth. Has to do with load balance issues; how well
  the file system balances between competing nodes in a heavy-use period.}
}

@InProceedings{french:ipsc2io,
  author = {James C. French and Terrence W. Pratt and Mriganka Das},
  title = {Performance Measurement of a Parallel Input/Output System for the
  {Intel iPSC/2} Hypercube},
  booktitle = {Proceedings of the 1991 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1991},
  pages = {178--187},
  earlier = {french:ipsc2io-tr},
  later = {french:ipsc2io-jpdc},
  keywords = {parallel I/O, Intel iPSC/2, pario-bib}
}

@Article{french:ipsc2io-jpdc,
  author = {James C. French and Terrence W. Pratt and Mriganka Das},
  title = {Performance Measurement of the {Concurrent File System} of the
  {Intel iPSC/2} Hypercube},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1993},
  month = {January and February},
  volume = {17},
  number = {1--2},
  pages = {115--121},
  publisher = {Academic Press},
  earlier = {french:ipsc2io},
  keywords = {parallel I/O, Intel iPSC/2, pario-bib}
}

@TechReport{french:ipsc2io-tr,
  author = {James C. French and Terrence W. Pratt and Mriganka Das},
  title = {Performance Measurement of a Parallel Input/Output System for the
  {Intel iPSC/2} Hypercube},
  year = {1991},
  number = {IPC-TR-91-002},
  institution = {Institute for Parallel Computation, University of Virginia},
  later = {french:ipsc2io},
  URL = {ftp://ftp.cs.virginia.edu/pub/techreports/IPC-91-02.ps.Z},
  keywords = {parallel I/O, Intel iPSC/2, disk caching, prefetching,
  pario-bib},
  comment = {Nice study of performance of existing CFS system on 32-node + 4
  I/O-node iPSC/2. They show big improvements due to declustering,
  preallocation, caching, and prefetching. See also pratt:twofs.}
}

@InProceedings{galbreath:applio,
  author = {N. Galbreath and W. Gropp and D. Levine},
  title = {Applications-Driven Parallel {I/O}},
  booktitle = {Proceedings of Supercomputing '93},
  year = {1993},
  pages = {462--471},
  publisher = {IEEE Computer Society Press},
  address = {Portland, OR},
  later = {galbreath:bapplio},
  keywords = {parallel I/O, pario-bib},
  comment = {They give a useful overview of the I/O requirements of many
  applications codes, in terms of input, output, scratch files, debugging, and
  checkpointing. They also describe their architecture-independent I/O
  interface that provides calls to read and write entire arrays, with some
  flexibility in the format and distribution of the array. Curious centralized
  control method. Limited performance evaluation. They're trying to keep the
  I/O media, file layout, and I/O architecture transparent to the user.
  Implementation decides which processors actually do read/write. Data
  formatted or unformatted; file sequential or parallel; can specify
  distributed arrays with ghost points. Runs on lots of platforms; will also be
  implementing on IBM SP-1 with disk per node, 128 nodes. Their package is
  freely available via ftp. Future: buffer-size experiments, unstructured data,
  use parallel file internally and then seqeuentialize on close.}
}

@InCollection{galbreath:bapplio,
  author = {Nicholas P. Galbreath and William D. Gropp and David M. Levine},
  title = {Applications-Driven Parallel {I/O}},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {36},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {539--547},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {galbreath:applio},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, pario-bib},
  comment = {Part of jin:io-book; reformatted version of galbreath:applio.}
}

@Article{ganger:diskarray,
  author = {Gregory R. Ganger and Bruce L. Worthington and Robert Y. Hou and
  Yale N. Patt},
  title = {Disk Arrays: High Performance, High-Reliability Storage Subsystems},
  journal = {IEEE Computer},
  year = {1994},
  month = {March},
  volume = {27},
  number = {3},
  pages = {30--36},
  publisher = {IEEE Computer Society Press},
  keywords = {disk array, RAID, parallel I/O, pario-bib, survey}
}

@InProceedings{ganger:load-balance,
  author = {Gregory R. Ganger and Bruce L. Worthington and Robert Y. Hou and
  Yale N. Patt},
  title = {Disk Subsystem Load Balancing: Disk Striping vs. Conventional Data
  Placement},
  booktitle = {Proceedings of the Twenty-Sixth Annual Hawaii International
  Conference on System Sciences},
  year = {1993},
  volume = {I},
  pages = {40--49},
  keywords = {parallel I/O, disk striping, load balancing, pario-bib},
  comment = {Using trace-driven simulation to compare dynamic load-balancing
  techniques in databases that span several disk drives, with the inherent
  load-balancing of striping. Their traces were from two Oracle databases on
  two different NCR systems. They found that striping, with its essentially
  random block-by-block load balancing, does a better job of avoiding
  short-term load imbalances than the ``manual'' load-balancing does.}
}

@Article{garcia:expand-design,
  author = {F\'elix Garcia-Carballeira and Alejandro Calderon and Jesus
  Carretero and Javier Fernandez and Jose M. Perez},
  title = {The Design of the {Expand} Parallel File System},
  journal = {The International Journal of High Performance Computing
  Applications},
  year = {2003},
  volume = {17},
  number = {1},
  pages = {21--38},
  publisher = {Sage Science Press},
  URL =
  {http://www.sagepub.co.uk/JournalIssueAbstract.aspx?pid=105593&jiid=515697&jiaid=33121},
  keywords = {parallel file system, parallel I/O, pario-bib},
  abstract = {This article describes an implementation of MPI-IO using a new
  parallel file system, called Expand (Expandable Parallel File System), which
  is based on NFS servers. Expand combines multiple NFS servers to create a
  distributed partition where files are striped. Expand requires no changes to
  the NFS server and uses RPC operations to provide parallel access to the same
  file. Expand is also independent of the clients, because all operations are
  implemented using RPC and NFS protocols. Using this system, we can join
  heterogeneous servers (Linux, Solaris, Windows 2000, etc.) to provide a
  parallel and distributed partition. The article describes the design,
  implementation and evaluation of Expand with MPI-IO. This evaluation has been
  made in Linux clusters and compares Expand and PVFS.}
}

@Article{garcia:striping-reliability,
  author = {Hector Garcia-Molina and Kenneth Salem},
  title = {The Impact of Disk Striping on Reliability},
  journal = {IEEE Database Engineering Bulletin},
  year = {1988},
  month = {March},
  volume = {11},
  number = {1},
  pages = {26--39},
  keywords = {parallel I/O, disk striping, reliability, disk array, pario-bib},
  comment = {Reliability of striped filesystems may not be as bad as you think.
  Parity disks help. Performance improvements limited to small number of disks
  ($n<10$). Good point: efficiency of striping will increase as the gap between
  CPU/memory performance and disk speed and file size widens. Reliability may
  be better if measured in terms of performing a task in time T, since the
  striped version may take less time. This gives disks less opportunity to fail
  during that period. Also consider the CPU failure mode, and its use over less
  time.}
}

@InProceedings{garg:tflops-pfs,
  author = {Sharad Garg},
  title = {{TFLOPS PFS}: Architecture and Design of A Highly Efficient Parallel
  File System},
  booktitle = {Proceedings of SC98: High Performance Networking and Computing},
  year = {1998},
  month = {November},
  publisher = {ACM Press},
  URL = {http://www.supercomp.org/sc98/TechPapers/sc98_FullAbstracts/Garg891/},
  keywords = {parallel file system, intel, ASCI Red, pario-bib},
  abstract = {In recent years, many commercial Massively Parallel Processor
  (MPP) systems have been available to the computing community. These systems
  provide very high processing power (up to hundreds of GFLOPs), and can scale
  efficiently with the number of processors. However, many scientific and
  commercial applications that run on these multiprocessors may not experience
  significant benefit in terms of speedup and are bottlenecked by their I/O
  requirements. Although these multiprocessors may be configured with
  sufficient I/O hardware, the file system software often fails to provide the
  available I/O bandwidth to the application, and causes severe performance
  degradation for I/O intensive applications. \par A highly efficient parallel
  file system has been implemented on Intel's Teraflops (TFLOPS) machine and
  provides a sustained I/O bandwidth of 1 GB/sec. This file system provides
  almost 95\% of the available raw hardware I/O bandwidth and the I/O bandwidth
  scales proportional to the available I/O nodes. \par Intel's TFLOPS machine
  is the first Accelerated Strategic Computing Initiative (ASCI) machine that
  DOE has acquired. This computer is 10 times more powerful than the fastest
  machine today, and will be used primarily to simulate nuclear testing and to
  ensure the safety and effectiveness of the nation's nuclear weapons
  stockpile. \par This machine contains over 9000 Intel's Pentium Pro
  processors, and will provide a peak CPU performance of 1.8 teraflops. This
  papers presents the I/O design and architecture of Intel's TFLOPS
  supercomputer, describes the Cougar OS I/O and its interface with the Intel's
  Parallel File System.},
  comment = {Describes the parallel file system for ASCI Red. The paper is only
  available as HTML}
}

@Article{gava:parallel-ml,
  author = {Fr\'{e}d\'{e}ric Gava},
  title = {Parallel {I/O} in bulk-synchronous parallel {ML}},
  journal = {Lecture Notes in Computer Science},
  booktitle = {4th International Conference on Computational Science (ICCS
  2004); June 6-9, 2004; Krakow, POLAND},
  editor = {Bubak, M; VanAlbada, GD; Sloot, PMA; Dongarra, JJ},
  year = {2004},
  month = {June},
  volume = {3038},
  pages = {331--338},
  institution = {Univ Paris 12, LACL, Creteil, France},
  publisher = {Springer-Verlag},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL =
  {http://springerlink.metapress.com/openurl.asp?genre=article&issn=0302-9743&volume=3038&spage=50},
  keywords = {parallel I/O, parallel ML, BSML, data parallel language,
  pario-bib},
  abstract = {Bulk Synchronous Parallel ML or BSML is a functional
  data-parallel language for programming bulk synchronous parallel (BSP)
  algorithms. The execution time can be estimated and dead-locks and
  indeterminism are avoided. For large scale applications where parallel
  processing is helpful and where the total amount of data often exceeds the
  total main memory available, parallel disk I/O becomes a necessity. We
  present here a library of I/O features for BSML and its cost model.}
}

@InCollection{gennart:bcomparing,
  author = {Benoit A. Gennart and Roger D. Hersch},
  title = {Comparing Multimedia Storage Architectures},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {37},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {548--554},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {gennart:comparing},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, multimedia, pario-bib},
  comment = {Part of jin:io-book; reformatted version of gennart:comparing.}
}

@InCollection{gennart:comparing,
  author = {Benoit A. Gennart and Roger D. Hersch},
  title = {Comparing Multimedia Storage Architectures},
  booktitle = {Proceedings of the International Conference on Multimedia
  Computing and Systems},
  year = {1995},
  pages = {323--328},
  later = {gennart:bcomparing},
  URL =
  {http://ieeexplore.ieee.org:80/xpl/tocresult.jsp?isNumber=10325&page=2},
  keywords = {parallel I/O, multimedia, pario-bib},
  abstract = {Multimedia interfaces increase the need for large image
  databases, capable of storing and reading streams of data with strict
  synchronicity and isochronicity requirements. In order to fulfil these
  requirements, we use a parallel image server architecture which relies on
  arrays of intelligent disk nodes, each disk node being composed of one
  processor and one or more disks. This contribution analyzes through
  simulation the real-time behavior of two multiprocessor multi-disk
  architectures: GigaView and the Unix workstation cluster. GigaView
  incorporates point-to-point communication between processing units and the
  workstation cluster supports communication through a shared bus-and-memory
  architecture. For a standard multimedia server architecture consisting of 8
  disks and 4 disk-node processors, we evaluate stream frame access times under
  various parameters such as load factors, frame size, stream throughput and
  synchronicity requirements. We compare the behavior of GigaView and the
  workstation cluster in terms of delay and delay jitter}
}

@Misc{gerner:sp2-io,
  author = {Jerry Gerner},
  title = {Input/Output on the {IBM SP2}--- An Overview},
  year = {1995},
  organization = {Cornell Theory Center, Cornell University},
  note = {Available at
  \verb+http://www.tc.cornell.edu/SmartNodes/Newsletters/IO.series/intro.html+},
  URL =
  {http://www.tc.cornell.edu/SmartNodes/Newsletters/IO.series/intro.html},
  keywords = {parallel I/O, IBM SP2, pario-bib}
}

@InCollection{ghandeharizadeh:bmitra,
  author = {Shahram Ghandeharizadeh and Roger Zimmermann and Weifeng Shi and
  Reza Rejaie and Douglas J. Ierardi and Ta-Wei Li},
  title = {{Mitra}: A Scalable Continuous Media Server},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {41},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {595--613},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {ghandeharizadeh:mitra},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {multimedia, parallel I/O, pario-bib},
  comment = {Part of jin:io-book; reformatted version of
  ghandeharizadeh:mitra.}
}

@Article{ghandeharizadeh:mitra,
  author = {Shahram Ghandeharizadeh and Roger Zimmermann and Weifeng Shi and
  Reza Rejaie and Doug Ierardi and Ta-Wei Li},
  title = {Mitra--- A Continuous Media Server},
  journal = {Multimedia Tools and Applications},
  year = {1998},
  month = {July},
  volume = {5},
  number = {1},
  pages = {79--108},
  publisher = {Kluwer Academic Publishers},
  later = {ghandeharizadeh:bmitra},
  URL = {http://perspolis.usc.edu/Users/zimmerma/mitra.html},
  keywords = {multimedia, parallel I/O, pario-bib},
  abstract = {Mitra is a scalable storage manager that supports the display of
  continuous media data types, e.g., audio and video clips. It is a software
  based system that employs off-the-shelf hardware components. Its present
  hardware platform is a cluster of multi-disk workstations, connected using an
  ATM switch. Mitra supports the display of a mix of media types. To reduce the
  cost of storage, it supports a hierarchical organization of storage devices
  and stages the frequently accessed objects on the magnetic disks. For the
  number of displays to scale as a function of additional disks, Mitra employs
  staggered striping. It implements three strategies to maximize the number of
  simultaneous displays supported by each disk. First, the EVEREST file system
  allows different files (corresponding to objects of different media types) to
  be retrieved at different block size granularities. Second, the FIXB
  algorithm recognizes the different zones of a disk and guarantees a
  continuous display while harnessing the average disk transfer rate. Third,
  Mitra implements the Grouped Sweeping Scheme (GSS) to minimize the impact of
  disk seeks on the available disk bandwidth. \par In addition to reporting on
  implementation details of Mitra, we present performance results that
  demonstrate the scalability characteristics of the system. We compare the
  obtained results with theoretical expectations based on the bandwidth of
  participating disks. Mitra attains between 65% to 100% of the theoretical
  expectations.},
  comment = {This paper describes the continous media server Mita. Mita runs on
  a cluster of multi-disk HP 9000/735 workstations. Each workstation consists
  of 80 Mbytes of memory and four disks. They implement ''staggered striping''
  of the data in which disks are clustered based on media type and treated as a
  single logical unit. Data is then striped across the logical disk cluster in
  a round-robin fashion. They present performance results as a function of
  total number of disks and the number of disks in a cluster.}
}

@Article{ghandeharizadeh:servers,
  author = {Shahram Ghandeharizadeh and Richard Muntz},
  title = {Design and implementation of scalable continuous media servers},
  journal = {Parallel Computing},
  year = {1998},
  month = {January},
  volume = {24},
  number = {1},
  pages = {91--122},
  URL = {http://dx.doi.org/10.1016/S0167-8191(97)00118-X},
  keywords = {parallel I/O, multimedia, pario-bib},
  comment = {Part of a special issue.}
}

@InProceedings{ghemawat:googlefs,
  author = {Sanjay Ghemawat and Howard Gobioff and Shun-Tak Leung},
  title = {The {Google} File System},
  booktitle = {Proceedings of the Nineteenth ACM Symposium on Operating Systems
  Principles},
  year = {2003},
  month = {October},
  pages = {96--108},
  publisher = {ACM Press},
  address = {Bolton Landing, NY},
  URL = {http://www.cs.rochester.edu/sosp2003/papers/p125-ghemawat.pdf},
  keywords = {distributed file system, pario-bib},
  abstract = {We have designed and implemented the Google File System, a
  scalable distributed file system for large distributed data-intensive
  applications. It provides fault tolerance while running on inexpensive
  commodity hardware, and it delivers high aggregate performance to a large
  number of clients. While sharing many of the same goals as previous
  distributed file systems, our design has been driven by observations of our
  application workloads and technological environment, both current and
  anticipated, that reflect a marked departure from some earlier file system
  assumptions. This has led us to re-examine traditional choices and explore
  radically different design points. \par The file system has successfully met
  our storage needs. It is widely deployed within Google as the storage
  platform for the generation and processing of data used by our service as
  well as research and development efforts that require large data sets. The
  largest cluster to date provides hundreds of terabytes of storage across
  thousands of disks on over a thousand machines, and it is concurrently
  accessed by hundreds of clients. \par In this paper, we present file system
  interface extensions designed to support distributed applications, discuss
  many aspects of our design, and report measurements from both
  micro-benchmarks and real world use.}
}

@Article{ghosh:hyper,
  author = {Joydeep Ghosh and Kelvin D. Goveas and Jeffrey T. Draper},
  title = {Performance Evaluation of a Parallel {I/O} Subsystem for Hypercube
  Multiprocessors},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1993},
  month = {January and February},
  volume = {17},
  number = {1--2},
  pages = {90--106},
  publisher = {Academic Press},
  keywords = {parallel I/O, MIMD, multiprocessor architecture, hypercube,
  pario-bib},
  comment = {Given a hypercube that has I/O nodes scattered throughout, they
  compare a plain one to one that has the I/O nodes also interconnected with a
  half-size hypercube. They show that this has better performance because the
  I/O traffic does not interfere with normal inter-PE traffic. See also
  ghosh:pario.}
}

@InProceedings{ghosh:pario,
  author = {Joydeep Ghosh and Bipul Agarwal},
  title = {Parallel {I/O} Subsystems for Distributed-Memory Multiprocessors},
  booktitle = {Proceedings of the Fifth International Parallel Processing
  Symposium},
  year = {1991},
  pages = {381--384},
  keywords = {multiprocessor file system, parallel I/O, pario-bib},
  comment = {They simulate a 128-node hypercube with 16 I/O nodes attached at
  uniformly distributed points. They compare two architectures: one with a
  separate I/O network, and another without a separate I/O network. When there,
  the extra network is used to route I/O packets from the originating I/O node
  to the I/O node closest to the destination processing node (or vice versa).
  They run simulations under workloads with differing amounts of locality, and
  experiment with different bandwidths for the links. They conclude that the
  extra network helps. But they never make the (proper, fair) comparison where
  the total network bandwidth is held constant. See also ghosh:hyper.}
}

@Article{gibson:arrays,
  author = {Garth A. Gibson},
  title = {Designing Disk Arrays for High Data Reliability},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1993},
  month = {January/February},
  volume = {17},
  number = {1--2},
  pages = {4--27},
  publisher = {Academic Press},
  keywords = {parallel I/O, RAID, redundancy, reliability, pario-bib}
}

@Book{gibson:book,
  author = {Garth A. Gibson},
  title = {Redundant Disk Arrays: Reliable, Parallel Secondary Storage},
  year = {1992},
  series = {An ACM Distinguished Dissertation 1991},
  publisher = {MIT Press},
  keywords = {parallel I/O, disk array, disk striping, reliability, RAID,
  pario-bib},
  comment = {Excellent book. Good source for discussion of the access gap and
  transfer gap, disk lifetimes, parity methods, reliability analysis, and
  generally the case for RAIDs. Page 220 he briefly discusses multiprocessor
  I/O architecture.}
}

@InCollection{gibson:bstorage,
  author = {Garth A. Gibson and David F. Nagle and Khalil Amiri and Jeff Butler
  and Fay W. Chang and Howard Gobioff and Charles Hardin and Erik Riedel and
  David Rochberg and Jim Zelenka},
  title = {A Cost-Effective, High-Bandwidth Storage Architecture},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {28},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {431--444},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {gibson:storage},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {network-attached storage, storage architecture, parallel I/O,
  pario-bib},
  comment = {Part of jin:io-book; reformatted version of gibson:storage.}
}

@InProceedings{gibson:dram,
  author = {Garth A. Gibson and R. Hugo Patterson and M. Satyanarayanan},
  title = {Disk Reads with {DRAM} Latency},
  booktitle = {Third Workshop on Workstation Operating Systems},
  year = {1992},
  pages = {126--131},
  later = {patterson:informed},
  URL = {http://www.cs.cmu.edu/afs/cs/project/pdl/ftp/TIP/WWOSIII.ps},
  keywords = {file system, prefetching, pario-bib},
  abstract = {The most difficult and frequently most important challenge for
  high performance file access is the achievement of low latency cache misses.
  We propose to explore the utility and feasibility of using file access hints
  to schedule overlapped prefetching of file data. Hints may be issued
  explicitly by programmers, automatically by compilers, speculatively by
  parent tasks such as shells and makes, or historically by previously profiled
  executions. Our research will also address the thorny issues of hint
  specification, memory resource management, imprecise and incorrect hints, and
  appropriate interfaces for propogating hints through and to effected
  application, operating system, file system, and device specific modules. We
  begin our research with a detailed examination of two applications with large
  potential for improvement: compilation of multiple module software systems
  and scientific simulation using very large grid state files.},
  comment = {A relatively early TIP report with nothing really new over
  patterson:tip.}
}

@InProceedings{gibson:failcorrect,
  author = {Garth A. Gibson and Lisa Hellerstein and Richard M. Karp and Randy
  H. Katz and David A. Patterson},
  title = {Failure Correction Techniques for Large Disk Arrays},
  booktitle = {Proceedings of the Third International Conference on
  Architectural Support for Programming Languages and Operating Systems},
  year = {1989},
  month = {April},
  pages = {123--132},
  earlier = {gibson:raid},
  later = {gibson:bfailcorrect},
  URL = {http://portal.acm.org/citation.cfm?id=70082.68194},
  keywords = {parallel I/O, disk array, RAID, reliability, pario-bib},
  abstract = {The ever increasing need for I/O bandwidth will be met with ever
  larger arrays of disks. These arrays require redundancy to protect against
  data loss. This paper examines alternative choices for encodings, or codes,
  that reliably store information in disk arrays. Codes are selected to
  maximize mean time to data loss or minimize disks containing redundant data,
  but are all constrained to minimize performance penalties associated with
  updating information or recovering from catastrophic disk failures. We also
  codes that give highly reliable data storage with low redundant data overhead
  for arrays of 1000 information disks.},
  comment = {See gibson:raid for comments since it is the same.}
}

@Article{gibson:nasd-scaling,
  author = {Garth A. Gibson and David F. Nagle and Khalil Amiri and Fay W.
  Chang and Eugene M. Feinberg and Howard Gobioff and Chen Lee and Berend
  Ozceri and Erik Riedel and David Rochberg and Jim Zelenka},
  title = {File server scaling with network-attached secure disks.},
  journal = {Performance Evaluation Review},
  booktitle = {1997 ACM International Conference on Measurement and Modeling of
  Computer Systems (SIGMETRICS 97), 15-18 June 1997, Seattle, WA, USA},
  year = {1997},
  volume = {25},
  number = {1},
  pages = {272 -- 84},
  publisher = {ACM Press},
  copyright = {(c)2004 IEE},
  URL = {http://www.pdl.cmu.edu/PDL-FTP/NASD/Sigmetrics97.pdf},
  keywords = {NASD, network-attached disks, distributed file system, parallel
  file system, security, secure disks, pario-bib},
  abstract = {By providing direct data transfer between storage and client,
  network-attached storage devices have the potential to improve scalability
  for existing distributed file systems (by removing the server as a
  bottleneck) and bandwidth for new parallel and distributed file systems
  (through network striping and more efficient data paths). Together, these
  advantages influence a large enough fraction of the storage market to make
  commodity network-attached storage feasible. Realizing the technology's full
  potential requires careful consideration across a wide range of file system,
  networking and security issues. This paper contrasts two network-attached
  storage architectures-(1) Networked SCSI disks (NetSCSI) are network attached
  storage devices with minimal changes from the familiar SCSI interface, while
  (2) Network-Attached Secure Disks (NASD) are drives that support independent
  client access to drive object services. To estimate the potential performance
  benefits of these architectures, we develop an analytic model and perform
  trace-driven replay experiments based on AFS and NFS traces. Our results
  suggest that NetSCSI can reduce file server load during a burst of NFS or AFS
  activity by about 30%. With the NASD architecture, server load (during burst
  activity) can be reduced by a factor of up to five for AFS and up to ten for
  NFS.},
  comment = {Essentially, the conference (and subsequent) journal version of
  gibson:nasd-tr. The studies that use simple analytical models (based on
  measured workloads of NFS and AFS file managers) to compare performance of
  NASD to SAD (storage-attached disks) and NetSCSI are often cited as
  justification for the NASD and object-based storage approaches.}
}

@TechReport{gibson:nasd-tr,
  author = {Garth A. Gibson and David P. Nagle and Khalil Amiri and Fay W.
  Chang and Eugene Feinberg and Howard Gobioff Chen Lee and Berend Ozceri and
  Erik Riedel and David Rochberg},
  title = {A Case for Network-Attached Secure Disks},
  year = {1996},
  month = {June},
  number = {CMU--CS-96-142},
  institution = {Carnegie-Mellon University},
  URL = {http://www.pdl.cmu.edu/PDL-FTP/NASD/TR96-142.pdf},
  keywords = {parallel I/O, network attached storage, distributed file systems,
  computer security, network attached secure disks, NASD, capability system,
  pario-bib},
  abstract = {By providing direct data transfer between storage and client,
  network-attached storage devices have the potential to improve scalability
  (by removing the server as a bottleneck) and performance (through network
  striping and shorter data paths). Realizing the technology's full potential
  requires careful consideration across a wide range of file system, networking
  and security issues. To address these issues, this paper presents two new
  network-attached storage architectures. (1) Networked SCSI disks (NetSCSI)
  are network-attached storage devices with minimal changes from the familiar
  SCSI interface (2) Network-attached secure disks (NASD) are drives that
  support independent client access to drive provided object services. For both
  architectures, we present a sketch of repartitionings of distributed file
  system functionality, including a security framework whose strongest levels
  use tamper resistant processing in the disks to provide action authorization
  and data privacy even when the drive is in a physically insecure location.
  \par Using AFS and NFS, trace results suggest that NetSCSI can reduce file
  server load during a burst of AFS activity by a factor of about 2; for the
  NASD architecture, server load (during burst activity) can be reduced by a
  factor of about 4 for AFS and 10 for NFS.},
  comment = {They outline their rationale for the idea of Network-attached
  Secure Disks (NASD). Basically the idea is to develop disk drives that attach
  right to the LAN, rather than to a file server, and allow clients to access
  the disks directly for many of the simpler file system actions (read and
  write file data, read file attributes), and only contact the server for more
  complex activities (opening and creating files, changing attributes). This
  removes the load from file servers, which are getting too slow to move large
  amounts of data needed by large installations. Issues include security, of
  course, which they solve with encryption (for privacy) and time-limited
  capabilities (keys) given out by the server to authenticated clients, which
  the clients show to the disk to gain access. They compare the performance of
  NASD, using a simple analytical model and parameters obtained from measuring
  real NFS and AFS implementations, to the performance of SAD (server-attached
  disks) and NetSCSI (a hybrid approach that involves the server in every
  operation but allows data to flow directly from disk to and from the
  network).}
}

@TechReport{gibson:raid,
  author = {Garth Gibson and Lisa Hellerstein and Richard Karp and Randy Katz
  and David Patterson},
  title = {Coding techniques for handling failures in large disk arrays},
  year = {1988},
  month = {December},
  number = {UCB/CSD 88/477},
  institution = {UC Berkeley},
  later = {gibson:failcorrect},
  URL = {http://cs-tr.cs.berkeley.edu/TR/UCB:CSD-88-477},
  keywords = {parallel I/O, RAID, reliability, disk array, pario-bib},
  comment = {Design of parity encodings to handle more than one bit failure in
  any group. Their 2-bit correcting codes are good enough for 1000-disk RAIDs
  that 3-bit correction is not needed.}
}

@TechReport{gibson:raidframe-tr,
  author = {Garth A. Gibson and William V. {Courtright II} and Mark Holland and
  Jim Zelenka},
  title = {{RAIDframe}: Rapid prototyping for disk arrays},
  year = {1995},
  month = {October},
  number = {CMU-CS-95-200},
  institution = {Carnegie Mellon University},
  later = {courtright:raidframe},
  URL = {http://www.cs.cmu.edu/afs/cs/project/pdl/ftp/RAID/CMU-CS-95-200.ps},
  keywords = {parallel I/O, RAID, disk array, reliability, simulation,
  pario-bib},
  comment = {Short version appeared as courtright:raidframe. Pretty neat idea.
  They provide a way to express the sequence of disk-access operations in a
  RAID controller using directed acyclic graphs, and a library that can
  `execute' these graphs either in a simulation or in an software-RAID
  implementation. The big benefit is that it is faster, easier, and less
  error-prone to implement various RAID management policies.}
}

@TechReport{gibson:scotch-tr,
  author = {Garth A. Gibson and Daniel Stodolsky and Pay W. Chang and William
  V. {Courtright II} and Chris G. Demetriou and Eka Ginting and Mark Holland
  and Qingming Ma and LeAnn Neal and R. Hugo Patterson and Jiawen Su and Rachad
  Youssef and Jim Zelenka},
  title = {The {Scotch} Parallel Storage Systems},
  year = {1995},
  month = {January},
  number = {CMU-CS-95-107},
  institution = {Carnegie Mellon University},
  later = {gibson:scotch1},
  URL =
  {http://www.cs.cmu.edu/afs/cs.cmu.edu/project/pdl/ftp/SPFS/tr95-107.ps},
  keywords = {parallel I/O, RAID, disk array, multiprocessor file system, file
  prefetching, file caching, cache consistency, pario-bib}
}

@InProceedings{gibson:scotch1,
  author = {Garth A. Gibson and Daniel Stodolsky and Pay W. Chang and William
  V. {Courtright II} and Chris G. Demetriou and Eka Ginting and Mark Holland
  and Qingming Ma and LeAnn Neal and R. Hugo Patterson and Jiawen Su and Rachad
  Youssef and Jim Zelenka},
  title = {The {Scotch} Parallel Storage Systems},
  booktitle = {Proceedings of 40th IEEE Computer Society International
  Conference (COMPCON 95)},
  year = {1995},
  month = {Spring},
  pages = {403--410},
  address = {San Francisco},
  earlier = {gibson:scotch-tr},
  URL = {http://www.cs.cmu.edu/afs/cs/project/pdl/ftp/SPFS/Compcon95.ps},
  keywords = {parallel I/O, RAID, disk array, multiprocessor file system, file
  prefetching, file caching, cache consistency, pario-bib},
  comment = {An overview of research being done in Garth's group. Touches on
  work in RAID disk arrays, parallel file systems, and prefetching. I think
  gibson:scotch-tr is nearly the same.}
}

@Article{gibson:sdcr,
  author = {Garth A. Gibson and Jeffrey Scott Vitter and John Wilkes},
  title = {Strategic directions in storage {I/O} issues in large-scale
  computing},
  journal = {ACM Computing Surveys},
  year = {1996},
  month = {December},
  volume = {28},
  number = {4},
  pages = {779--793},
  URL = {ftp://ftp.cs.duke.edu/pub/jsv/Papers/GVW96.storage_IO.ps.gz},
  keywords = {supercomputing, data storage, database, parallel I/O, pario-bib},
  abstract = {We discuss the strategic directions and challenges in the
  management and use of storage systems--those components of computer systems
  responsible for the storage and retrieval of data. The performance gap
  between main and secondary memories shows no imminent sign of vanishing, and
  thus continuing research into storage I/O will be essential to reap the full
  benefit from the advances occurring in many other areas of computer science.
  In this report we identify a few strategic research goals and possible
  thrusts to meet those goals.},
  comment = {A more reliable, but limited-access, URL is
  http://www.acm.org/pubs/citations/journals/surveys/1996-28-4/p779-gibson/}
}

@InProceedings{gibson:storage,
  author = {Garth Gibson and David Nagle and Khalil Amiri and Jeff Butler and
  Fay Chang and Howard Gobioff and Charles Hardin and Erik Riedel and David
  Rochberg and Jim Zelenka},
  title = {A Cost-Effective High-Bandwidth Storage Architecture},
  booktitle = {Proceedings of the Eighth International Conference on
  Architectural Support for Programming Languages and Operating Systems},
  year = {1998},
  pages = {92--104},
  publisher = {ACM Press},
  later = {gibson:bstorage},
  URL =
  {http://www.acm.org/pubs/citations/proceedings/asplos/291069/p92-gibson/},
  keywords = {network-attached storage, storage architecture, parallel I/O,
  pario-bib}
}

@InProceedings{golding:attribute,
  author = {Richard Golding and Elizabeth Shriver and Tim Sullivan and John
  Wilkes},
  title = {Attribute-Managed Storage},
  booktitle = {Workshop on Modeling and Specification of I/O},
  year = {1995},
  organization = {At SPDP'95},
  URL = {http://www.hpl.hp.com/personal/John_Wilkes/papers/MSIO.ps.Z},
  keywords = {I/O architecture, disk array, RAID, file system, storage system,
  pario-bib},
  abstract = {Storage systems are continuing to grow, and they are become
  shared resources with the advent of I/O networks like FibreChannel. Managing
  these resources to meet performance and resiliency goals is becoming a
  significant challenge. We believe that completely automatic,
  attribute-managed storage is the way to address this issue. Our approach is
  based on declarative specifications of both application workloads and device
  characteristics. These are combined by a matching engine to generate a
  load-assignment that provides optimal performance and meets availability
  guarantees, at minimum cost.},
  comment = {This is just a 4-page position paper. See also shriver:slides.}
}

@InProceedings{golubchik:reducing,
  author = {Leana Golubchik and John C. S. Lui and Richard Muntz},
  title = {Reducing {I/O} Demand in Video-on-Demand Storage Servers},
  booktitle = {Proceedings of the 1995 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1995},
  month = {May},
  pages = {25--36},
  keywords = {video server, multimedia, parallel I/O, pario-bib},
  comment = {An approach called adaptive piggybacking groups together streams
  that are watching the same video, but at slightly different times, so that
  they can share the I/O streams.}
}

@InProceedings{golubchik:striping,
  author = {Leana Golubchik and Richard R. Muntz and Richard W. Watson},
  title = {Analysis of Striping Techniques in Robotic Storage Libraries},
  booktitle = {Proceedings of the Fourteenth IEEE Symposium on Mass Storage
  Systems},
  year = {1995},
  month = {September},
  pages = {225--238},
  publisher = {IEEE Computer Society Press},
  URL = {http://www.computer.org/conferen/mss95/golub/golub.htm},
  keywords = {mass storage, parallel I/O, pario-bib},
  abstract = {In recent years advances in computational speed have been the
  main focus of research and development in high performance computing. In
  comparison, the improvement in I/O performance has been modest. Faster
  processing speeds have created a need for faster I/O as well as for the
  storage and retrieval of vast amounts of data. The technology needed to
  develop these mass storage systems exists today. Robotic storage libraries
  are vital components of such systems. However, they normally exhibit high
  latency and long transmission times. We analyze the performance of robotic
  storage libraries and study striping as a technique for improving response
  time. Although striping has been extensively studied in the content of disk
  arrays, the architectural differences between robotic storage libraries and
  arrays of disks suggest that a separate study of striping techniques in such
  libraries would be beneficial.}
}

@Article{golubchik:survey,
  author = {Leana Golubchik and John C.S. Lui and Maria Papadopouli},
  title = {A survey of approaches to fault tolerant design of {VOD} servers:
  Techniques, analysis and comparison},
  journal = {Parallel Computing},
  year = {1998},
  month = {January},
  volume = {24},
  number = {1},
  pages = {123--155},
  URL = {http://dx.doi.org/10.1016/S0167-8191(97)00119-1},
  keywords = {parallel I/O, multimedia, survey, pario-bib},
  comment = {Part of a special issue.}
}

@InProceedings{goodrich:external,
  author = {Michael T. Goodrich and Jyh-Jong Tsay and Darren E. Vengroff and
  Jeffrey Scott Vitter},
  title = {External-Memory Computational Geometry},
  booktitle = {Proceedings of the 34th Annual Symposium on Foundations of
  Computer Science},
  year = {1993},
  month = {November},
  pages = {714--723},
  keywords = {computational geometry, parallel I/O algorithm, pario-bib},
  abstract = {In this paper, we give new techniques for designing efficient
  algorithms for computational geometry problems that are too large to be
  solved in internal memory, and we use these techniques to develop optimal and
  practical algorithms for a number of important large-scale problems in
  computational geometry. Our algorithms are optimal for a wide range of
  two-level and hierarchical multilevel memory models, including parallel
  models. The algorithms are optimal in terms of both I/O cost and internal
  computation. \par Our results are built on four fundamental techniques: {\it
  distribution sweeping}, a generic method for externalizing plane-sweep
  algorithms; {\it persistent B-trees}, for which we have both on-line and
  off-line methods; {\it batch filtering}, a general method for performing $K$
  simultaneous external-memory searches in any data structure that can be
  modeled as a planar layered dag; and {\it external marriage-before-conquest},
  an external-memory analog of the well-known technique of Kirkpatrick and
  Seidel. Using these techniques we are able to solve a very large number of
  problems in computational geometry, including batched range queries, 2-d and
  3-d convex hull construction, planar point location, range queries, finding
  all nearest neighbors for a set of planar points, rectangle
  intersection/union reporting, computing the visibility of segments from a
  point, performing ray-shooting queries in constructive solid geometry (CSG)
  models, as well as several geometric dominance problems. \par These results
  are significant because large-scale problems involving geometric data are
  ubiquitous in spatial databases, geographic information systems (GIS),
  constraint logic programming, object oriented databases, statistics, virtual
  reality systems, and graphics. This work makes a big step, both theoretically
  and in practice, towards the effective management and manipulation of
  geometric data in external memory, which is an essential component of these
  applications.}
}

@InProceedings{gopinath:3tier,
  author = {K. Gopinath and Nitin Muppalaneni and N. Suresh Kumar and Pankaj
  Risbood},
  title = {A 3-Tier {RAID} Storage System with {RAID1}, {RAID5}, and Compressed
  {RAID5} for {Linux}},
  booktitle = {Proceedings of the FREENIX Track at the 2000 USENIX Annual
  Technical Conference},
  year = {2000},
  pages = {21--34},
  publisher = {USENIX Association},
  keywords = {parallel I/O, RAID, disk array, pario-bib}
}

@InProceedings{gotwals:pario,
  author = {Jacob Gotwals and Suresh Srinivas and Shelby Yang},
  title = {Parallel {I/O} from the User's Perspective},
  booktitle = {Proceedings of the Fifth Symposium on the Frontiers of Massively
  Parallel Computation},
  year = {1995},
  pages = {129--137},
  keywords = {parallel I/O, pario-bib}
}

@InProceedings{gotwals:streams,
  author = {Jacob Gotwals and Suresh Srinivas and Dennis Gannon},
  title = {{pC++}/streams: a Library for {I/O} on Complex Distributed Data
  Structures},
  booktitle = {Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and
  Practice of Parallel Programming},
  year = {1995},
  month = {July},
  pages = {11--19},
  publisher = {ACM Press},
  address = {Santa Barbara, CA},
  URL = {ftp://ftp.cs.indiana.edu/pub/techreports/TR422a.ps.Z},
  keywords = {parallel I/O, object-oriented, distributed data structures,
  runtime library, pario-bib},
  comment = {URL is for tech report version. They have a language called pC++
  that allows object-parallel programming. They have a library called d/streams
  for I/O of distributed arrays. pC++/streams is a combination. You open a
  file, specify the in-memory distribution, read from the stream, and then
  extract some variables. Likewise, you insert some variables (into the stream
  buffer), then write it. They manage the distribution, and they store
  necessary metadata to reassemble the data structure when reading. Variables
  can be arbitrary classes, with $>>$ and $<<$ overloaded as the insert and
  extract operators. Performance is reasonable on Intel Paragon and SGI
  Challenge.}
}

@Misc{gray:infinite,
  author = {Jim Gray},
  title = {What Happens When Processing, Storage, and Bandwidth are Free and
  Infinite?},
  year = {1997},
  month = {November},
  howpublished = {Keynote address at IOPADS~'97},
  URL = {http://www.research.microsoft.com/barc/gray/talks/IOPADS.ppt},
  keywords = {parallel computing, computer architecture, parallel I/O,
  pario-bib, memory hierarchy, distributed computing, database, object
  oriented},
  abstract = {Technology trends promise to give us processors with pico-second
  clock speeds. These pico-processors will spend much of their time waiting for
  information from the storage hierarchy. I believe this will force us to adopt
  a data-flow programming model. Similar trends will bless us with peta-byte
  online stores with exa-byte near-line stores. One large disk manufacture
  claims it costs 8$/year to manage a megabyte of online storage. That is 8
  Billion dollars per year to manage a petabyte. Automating storage management
  is one of our major challenges. This talk covers these technology trends,
  surveys the current status of commercial software tools (aka database
  systems), their peak performance and price performance. It then poses four
  major challenges: total-cost-of ownership, long-term archiving, reliably
  storing exabytes, and data mining on petabyte databases.},
  comment = {Very interesting talk. URL points to PowerPoint slides.}
}

@InProceedings{gray:stripe,
  author = {Jim Gray and Bob Horst and Mark Walker},
  title = {Parity Striping of Disk Arrays: Low-cost Reliable Storage with
  Acceptable Throughput},
  booktitle = {Proceedings of the 16th VLDB Conference},
  year = {1990},
  pages = {148--159},
  keywords = {disk striping, reliability, pario-bib},
  comment = {Parity striping, a variation of RAID 5, is just a different way of
  mapping blocks to disks. It groups parity blocks into extents, and does not
  stripe the data blocks. A logical disk is mostly contained in one physical
  disk, plus a parity region in another disk. Good for transaction processing
  workloads. Has the low cost/GByte of RAID, the reliability of RAID, without
  the high transfer rate of RAID, but with much better requests/second
  throughput than RAID 5. (But 40\% worse than mirrors.) So it is a compromise
  between RAID and mirrors. BUT, see mourad:raid.}
}

@TechReport{grimshaw:ELFSTR,
  author = {Andrew S. Grimshaw and Loyot, Jr., Edmond C.},
  title = {{ELFS:} Object-oriented Extensible File Systems},
  year = {1991},
  month = {July},
  number = {TR-91-14},
  institution = {Univ. of Virginia Computer Science Department},
  later = {grimshaw:elfs},
  URL = {ftp://ftp.cs.virginia.edu/pub/techreports/CS-91-14.ps.Z},
  keywords = {parallel I/O, parallel file system, object-oriented, file system
  interface, Intel iPSC/2, pario-bib},
  comment = {See also grimshaw:elfs. They hope to provide high bandwidth and
  low latency, reduce the cognitive burden on the programmer, and manage
  proliferation of data formats and architectural changes. Details of the plan
  to make an extensible OO interface to file system. Objects each have a
  separate thread of control, so they can do asynchronous activity like
  prefetching and caching in the background, and support multiple outstanding
  requests. The Mentat object system makes it easy for them to support
  pipelining of I/O with I/O and computation in the user program. Let the user
  choose type of consistency needed. See grimshaw:objects for more results.}
}

@InProceedings{grimshaw:elfs,
  author = {Andrew S. Grimshaw and Loyot, Jr., Edmond C.},
  title = {{ELFS:} Object-oriented Extensible File Systems},
  booktitle = {Proceedings of the First International Conference on Parallel
  and Distributed Information Systems},
  year = {1991},
  month = {December},
  pages = {177},
  earlier = {grimshaw:ELFSTR},
  later = {grimshaw:objects},
  keywords = {parallel I/O, parallel file system, object-oriented, file system
  interface, pario-bib},
  comment = {Full paper grimshaw:ELFSTR. Really neat idea. Uses OO interface to
  file system, which is mostly in user mode. The object classes represent
  particular access patterns (e.g., 2-D matrix) in the file, and hide the actual
  structure of the file. The object knows enough to taylor the cache and
  prefetch algorithms to the semantics. Class inheritance allows layering.}
}

@InProceedings{grimshaw:objects,
  author = {Andrew S. Grimshaw and Jeff Prem},
  title = {High Performance Parallel File Objects},
  booktitle = {Proceedings of the Sixth Annual Distributed-Memory Computer
  Conference},
  year = {1991},
  pages = {720--723},
  keywords = {parallel I/O, multiprocessor file system, file system interface,
  pario-bib},
  comment = {Not much new from ELFS TR. A better citation than grimshaw:ELFS
  though. Does give CFS performance results. Note on 721 he says that CFS
  prefetches into ``local memory from which to satisfy future user requests
  {\em that never come.}'' This happens if the local access pattern isn't
  purely sequential, as in an interleaved pattern.}
}

@Article{gropp:io-redundancy,
  author = {William D. Gropp and Robert Ross and Neill Miller},
  title = {Providing efficient {I/O} redundancy in {MPI} environments},
  journal = {Lecture Notes in Computer Science},
  booktitle = {11th European Parallel Virtural Machine and Message Passing
  Interface Users Group Meeting; September 19-22, 2004; Budapest, HUNGARY},
  editor = {Kranzlmuller, D; Kacsuk, P; Dongarra, J},
  year = {2004},
  month = {November},
  volume = {3241},
  pages = {77--86},
  institution = {Argonne Natl Lab, Div Math \& Comp Sci, 9700 S Cass Ave,
  Argonne, IL 60439 USA; Argonne Natl Lab, Div Math \& Comp Sci, Argonne, IL
  60439 USA},
  publisher = {Springer-Verlag Heidelberg},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL = {http://www.springerlink.com/link.asp?id=wxx7xg3hb3xftx8b},
  keywords = {fault-tolerance, single-disk failures, MPI-IO, pario-bib},
  abstract = {Highly parallel applications often use either highly parallel
  file systems or large numbers of independent disks. Either approach can
  provide the high data rates necessary for parallel applications. However, the
  failure of a single disk or server can render the data useless. Conventional
  techniques, such as those based on applying erasure correcting codes to each
  file write, are prohibitively expensive for massively parallel scientific
  applications because of the granularity of access at which the codes are
  applied. In this paper we demonstrate a scalable method for recovering from
  single disk failures that is optimized for typical scientific data sets. This
  approach exploits coarser-grained (but precise) semantics to reduce the
  overhead of constructing recovery data and makes use of parallel computation
  (proportional to the data size and independent of number of processors) to
  construct data. Experiments are presented showing the efficiency of this
  approach on a cluster with independent disks, and a technique is described
  for hiding the creation of redundant data within the MPI-IO implementation.}
}

@Book{gropp:mpi2,
  author = {William Gropp and Ewing Lusk and Rajeev Thakur},
  title = {Using {MPI-2}: Advanced Features of the Message-Passing Interface},
  year = {1999},
  publisher = {MIT Press},
  address = {Cambridge, MA},
  URL = {http://mitpress.mit.edu/book-home.tcl?isbn=0262571331},
  keywords = {parallel computing, message passing, parallel I/O, multiprocessor
  file system interface, pario-bib},
  abstract = {The Message Passing Interface (MPI) specification is widely used
  for solving significant scientific and engineering problems on parallel
  computers. There exist more than a dozen implementations on computer
  platforms ranging from IBM SP-2 supercomputers to clusters of PCs running
  Windows NT or Linux ("Beowulf" machines). The initial MPI Standard document,
  MPI-1, was recently updated by the MPI Forum. The new version, MPI-2,
  contains both significant enhancements to the existing MPI core and new
  features.\par Using MPI is a completely up-to-date version of the authors'
  1994 introduction to the core functions of MPI. It adds material on the new
  C++ and Fortran 90 bindings for MPI throughout the book. It contains greater
  discussion of datatype extents, the most frequently misunderstood feature of
  MPI-1, as well as material on the new extensions to basic MPI functionality
  added by the MPI-2 Forum in the area of MPI datatypes and collective
  operations.\par Using MPI-2 covers the new extensions to basic MPI. These
  include parallel I/O, remote memory access operations, and dynamic process
  management. The volume also includes material on tuning MPI applications for
  high performance on modern MPI implementations.},
  comment = {Has a large chapter on MPI-IO with lots of example programs.}
}

@InProceedings{gross:io,
  author = {Thomas Gross and Peter Steenkiste},
  title = {Architecture Implications of High-speed {I/O} for Distributed-Memory
  Computers},
  booktitle = {Proceedings of the 8th ACM International Conference on
  Supercomputing},
  year = {1994},
  month = {July},
  pages = {176--185},
  publisher = {ACM Press},
  address = {Manchester, UK},
  keywords = {parallel I/O, parallel architecture, networking, pario-bib},
  comment = {They examine the characteristics of a system that has I/O nodes
  which interface between the internal interconnection network of a
  distributed-memory MIMD machine and some external network, such as HIPPI.
  They build a simple model to show how different components affect the I/O
  throughput. They show the performance of their iWarp-HIPPI interface. They
  conclude that the I/O nodes must have sufficient memory bandwidth to support
  multiple data streams coming from several compute nodes, being combined into
  a single faster external network, or vice versa. They need to support
  scatter/gather, because the data is often distributed in small pieces. For
  the same reason, they need to have low per-message overhead. The internal
  network routing must allow multiple paths between compute nodes and the I/O
  nodes, to avoid congestion.}
}

@InCollection{grossi:crosstrees,
  author = {Roberto Grossi and Giuseppe F. Italiano},
  title = {Efficient Cross-trees for External Memory},
  booktitle = {External Memory Algorithms and Visualization},
  editor = {James Abello and Jeffrey Scott Vitter},
  crossref = {abello:dimacs},
  year = {1999},
  series = {DIMACS Series in Discrete Mathematics and Theoretical Computer
  Science},
  pages = {87--106},
  publisher = {American Mathematical Society Press},
  address = {Providence, RI},
  keywords = {out-of-core algorithm, data structure, pario-bib},
  comment = {See also the component papers vitter:survey, arge:lower,
  crauser:segment, grossi:crosstrees, toledo:survey. Not clear to what extent
  these papers are about *parallel* I/O.}
}

@InCollection{grossman:blibrary,
  author = {Robert Grossman and Xiao Qin and Wen Xu and Harry Hulen and Terry
  Tyler},
  title = {An Architecture for a Scalable High-Performance Digital Library},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {39},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {566--575},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {grossman:library},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {mass storage, parallel I/O, pario-bib},
  comment = {Part of jin:io-book; reformatted version of grossman:library.}
}

@InProceedings{grossman:library,
  author = {R. Grossman and X. Qin and W. Xu and H. Hulen and T. Tyler},
  title = {An Architecture for a Scalable High-Performance Digital Library},
  booktitle = {Proceedings of the Fourteenth IEEE Symposium on Mass Storage
  Systems},
  year = {1995},
  month = {September},
  pages = {89--98},
  publisher = {IEEE Computer Society Press},
  later = {grossman:blibrary},
  URL = {http://www.computer.org/conferen/mss95/grossman/grossman.htm},
  keywords = {mass storage, parallel I/O, pario-bib},
  abstract = {Requirements for a high-performance, scalable digital library of
  multimedia data are presented together with a layered architecture for a
  system that addresses the requirements. The approach is to view digital data
  as persistent collections of complex objects and to use lightweight object
  management to manage this data. To scale as the amount of data increases, the
  object management component is layered over a storage management component.
  The storage management component supports hierarchical storage, third-party
  data transfer and parallel input-output. Several issues that arise from the
  interface between the storage management and object management components are
  discussed. The authors have developed a prototype of a digital library using
  this design. Two key components of the prototype are AIM Net and HPSS. AIM
  Net is a persistent object manager and is a product of Oak Park Research.
  HPSS is the High Performance Storage System, developed by a collaboration
  including IBM Government Systems and several national labs.}
}

@InProceedings{gupta:generating,
  author = {Sandeep K. S. Gupta and Zhiyong Li and John H. Reif},
  title = {Generating Efficient Programs for Two-level Memories from
  Tensor-Products},
  booktitle = {Proceedings of the Seventh IASTED/ISMM International Conference
  on Parallel and Distributed Computing and Systems},
  year = {1995},
  month = {October},
  pages = {510--513},
  address = {Washington, D.C.},
  URL = {ftp://ftp.cs.duke.edu/pub/zli/papers/pdcs95.ps.gz},
  keywords = {parallel I/O algorithm, pario-bib},
  abstract = {This paper presents a framework for synthesizing efficient
  out-of-core programs for block recursive algorithms such as the fast Fourier
  transform (FFT) and Batcher's bitonic sort. The block recursive algorithms
  conside red in this paper are described using tensor (Kronecker) product and
  other matrix operations. The algebraic properties of the matrix
  representation are used to derive efficient out-of-core programs. These
  programs are targeted towards a two-level disk model which allows HPF
  supported cyclic(B) data distribution on a disk array. The effectiveness of
  our approach is demonstrated through an example out-of-core FFT program
  implemented on a work-station.}
}

@Article{hack:ncar,
  author = {James J. Hack and James M. Rosinski and David L. Williamson and
  Byron A. Boville and John E. Truesdale},
  title = {Computational design of the {NCAR} community climate model},
  journal = {Parallel Computing},
  year = {1995},
  volume = {21},
  pages = {1545--1569},
  publisher = {North-Holland (Elsevier Scientific)},
  keywords = {parallel computing, scientific computing, weather prediction,
  global climate model, parallel I/O, pario-bib},
  comment = {There is some discussion of I/O issues. This weather code does
  some out-of-core work, to communicate data between time steps. They also dump
  a 'history' file every simulated day, and periodic checkpoint files. They are
  flexible about the layout of the history file, assuming postprocessing will
  clean it up. The I/O is not too much trouble on the Cray C90, where they get
  350 MBps to the SSD for the out-of-core data. The history I/O is no problem.
  On distributed-memory machines with no SSD, out-of-core was impractical and
  the history file was only written once per simulated month. 'The most
  significant weakness in the distributed-memory implementation is the
  treatment of I/O, [due to] file system maturity....' See hammond:atmosphere
  and jones:skyhi in the same issue.}
}

@InProceedings{hacker:effects,
  author = {Thomas J. Hacker and Brian Noble and Brian D. Athey},
  title = {The Effects of Systemic Packet Loss on Aggregate {TCP} Flows},
  booktitle = {Proceedings of SC2002: High Performance Networking and
  Computing},
  year = {2002},
  month = {November},
  address = {Baltimore, MD},
  URL = {http://www-personal.engin.umich.edu/~hacker/papers/SC_2002_full.pdf},
  keywords = {network congestion, parallel tcp streams, transport protocols,
  pario-bib}
}

@InProceedings{hacker:fairness,
  author = {Thomas J. Hacker and Brian Noble and Brian D. Athey},
  title = {Improving Throughput and Mantaining Fairness using Parallel {TCP}},
  booktitle = {The 23rd Conference on the IEEE Communications Society
  (INFOCOM)},
  year = {2004},
  month = {March},
  publisher = {IEEE Computer Society Press},
  address = {Hong Kong},
  URL = {http://www.ieee-infocom.org/2004/Papers/52_1.PDF},
  keywords = {network congestion, parallel tcp streams, fairness, transport
  protocols, pario-bib},
  comment = {Also see earlier hacker:parallel-tcp and hacker:effects}
}

@InProceedings{hacker:parallel-tcp,
  author = {Thomas J. Hacker and Brian D. Athey and Brian Noble},
  title = {The end-to-end performance effects of parallel {TCP} sockets on a
  lossy wide-area network.},
  booktitle = {Proceedings of the International Parallel and Distributed
  Processing Symposium},
  year = {2002},
  month = {April},
  pages = {434--443},
  publisher = {IEEE Computer Society Press},
  copyright = {(c)2004 IEE},
  address = {Fort Lauderdale, Florida},
  URL = {http://www-personal.engin.umich.edu/~hacker/papers/IPDPS.PDF},
  keywords = {network congestion, parallel tcp streams, transport protocols,
  pario-bib},
  abstract = {This paper examines the effects of using parallel TCP flows to
  improve end-to-end network performance for distributed data intensive
  applications. A series of transmission experiments were conducted over a
  wide-area network to assess how parallel flows improve throughput, and to
  understand the number of flows necessary to improve throughput while avoiding
  congestion. An empirical throughput expression for parallel flows based on
  experimental data is presented, and guidelines for the use of parallel flows
  are discussed. (45 refs.)}
}

@InProceedings{hadimioglu:fs,
  author = {Haldun Hadimioglu and Robert J. Flynn},
  title = {The Architectural Design of a Tightly-Coupled Distributed Hypercube
  File System},
  booktitle = {Proceedings of the Fourth Conference on Hypercube Concurrent
  Computers and Applications},
  year = {1989},
  pages = {147--150},
  publisher = {Golden Gate Enterprises, Los Altos, CA},
  address = {Monterey, CA},
  keywords = {hypercube, multiprocessor file system, pario-bib},
  comment = {An early paper describing a proposed file system for hypercubes.
  The writing is almost impenetrable. Confusing and not at all clear what they
  propose. See also hadimioglu:hyperfs and flynn:hyper-fs.}
}

@InProceedings{hadimioglu:hyperfs,
  author = {Haldun Hadimioglu and Robert J. Flynn},
  title = {The Design and Analysis of a Tightly Coupled Hypercube File System},
  booktitle = {Proceedings of the Fifth Annual Distributed-Memory Computer
  Conference},
  year = {1990},
  pages = {1405--1410},
  keywords = {multiprocessor file system, parallel I/O, hypercube, pario-bib},
  comment = {Describes a hypercube file system based on I/O nodes and processor
  nodes. A few results from a hypercube simulator. See hadimioglu:fs and
  flynn:hyper-fs.}
}

@Article{hammond:atmosphere,
  author = {Steven W. Hammond and Richard D. Loft and John M. Dennis and
  Rochard K. Sato},
  title = {Implementation and performance issues of a massively parallel
  atmospheric model},
  journal = {Parallel Computing},
  year = {1995},
  volume = {21},
  pages = {1593--1619},
  publisher = {North-Holland (Elsevier Scientific)},
  keywords = {parallel computing, scientific computing, weather prediction,
  global climate model, parallel I/O, pario-bib},
  comment = {They discuss a weather code that runs on the CM-5. The code writes
  a history file, dumping some data every timestep, and periodically a restart
  file. They found that CM-5 Fortran met their needs, although required huge
  buffers to get much scalability. They want to see a single, shared
  file-system image from all processors, have the file format be independent of
  processor count, use portable conventional interface, and have throughput
  scale with the number of computation processors. See also hack:ncar and
  jones:skyhi in the same issue.}
}

@InProceedings{harry:vipfs,
  author = {Michael Harry and Juan Miguel {del Rosario} and Alok Choudhary},
  title = {{VIP-FS: A VIrtual, Parallel File System} for High Performance
  Parallel and Distributed Computing},
  booktitle = {Proceedings of the Ninth International Parallel Processing
  Symposium},
  year = {1995},
  month = {April},
  pages = {159--164},
  note = {Also appeared in ACM Operating Systems Review 29(3), July 1995 pages
  35--48},
  keywords = {parallel I/O, parallel file system, heterogeneous, pario-bib},
  comment = {See delrosario:vipfs-tr for an earlier version. Also appears as
  NPAC report SCCS-686.}
}

@Misc{hart:grid,
  author = {Leslie Hart and Tom Henderson and Bernardo Rodriguez},
  title = {An {MPI} Based Scalable Runtime System: {I/O} Support for a Grid
  Library},
  year = {1995 or earlier},
  keywords = {parallel I/O, runtime library, pario-bib},
  abstract = {In order to attain portability when using message passing on a
  distributed memory system, a portable message passing system must be used as
  well as other portable system support services. MPI[1] addresses the message
  passing problem. To date, there are no standards for system services and I/O.
  A library developed at NOAA's Forecast Systems Laboratory (FSL) known as the
  Nearest Neighbor Tool[2] (NNT) provides a high level portable interface to
  interprocess communications for finite difference approximation numerical
  weather prediction (NWP) models. In order to achieve portability, MPI is used
  to support interprocess communications. The other services are provided by
  the lower level library developed at NOAA/FSL known as the Scalable Runtime
  System (SRS). The principle focus of this paper is SRS.},
  comment = {They describe the runtime system that supports the
  Nearest-Neighbor Tool (NNT), which they use to parallelize weather-prediction
  codes. This paper gives a vague overview of the I/O support. The interface
  sounds fairly typical, as does the underlying structure (server processes,
  cache processes, etc). Sounds like it is in its early stages, but is useful
  for many applications.}
}

@InCollection{hartman:bzebra,
  author = {John H. Hartman and John K. Ousterhout},
  title = {The {Zebra} Striped Network File System},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {21},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {309--329},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {hartman:zebra3},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, distributed file system, disk striping, pario-bib},
  comment = {Part of jin:io-book; reformatted version of hartman:zebra3.}
}

@InProceedings{hartman:zebra,
  author = {John H. Hartman and John K. Ousterhout},
  title = {{Zebra: A} Striped Network File System},
  booktitle = {Proceedings of the USENIX File Systems Workshop},
  year = {1992},
  month = {May},
  pages = {71--78},
  later = {hartman:zebra2},
  keywords = {disk striping, distributed file system, pario-bib},
  comment = {Not a parallel file system, but worth comparing to Swift.
  Certainly, a similar idea could be used in a multiprocessor. Cite
  hartman:zebra3.}
}

@InProceedings{hartman:zebra2,
  author = {John H. Hartman and John K. Ousterhout},
  title = {The {Zebra} Striped Network File System},
  booktitle = {Proceedings of the Fourteenth ACM Symposium on Operating Systems
  Principles},
  year = {1993},
  pages = {29--43},
  publisher = {ACM Press},
  address = {Asheville, NC},
  earlier = {hartman:zebra},
  later = {hartman:zebra3},
  keywords = {file system, disk striping, distributed file system, RAID,
  log-structured file system, parallel I/O, pario-bib},
  comment = {Zebra stripes across network servers, but not on a file-by-file
  basis. Instead they use LFS ideas to stripe a per-client log across all file
  servers. Each client can then compute a parity block for each strip that it
  writes. They store ``deltas'', changes in block locations, in with the data,
  and also send them to the (central) file manager. The file manager and stripe
  cleaner are key state managers, that keep track of where blocks are located,
  and of stripe utilizations. Performance numbers limited to small-scale tests.
  This paper has more details than hartman:zebra, and performance numbers (but
  not with real workloads or stripe cleaner). Some tricky consistency issues.}
}

@Article{hartman:zebra3,
  author = {John H. Hartman and John K. Ousterhout},
  title = {The {Zebra} Striped Network File System},
  journal = {ACM Transactions on Computer Systems},
  year = {1995},
  month = {August},
  volume = {13},
  number = {3},
  pages = {274--310},
  publisher = {ACM Press},
  earlier = {hartman:zebra2},
  later = {hartman:bzebra},
  URL = {http://portal.acm.org/citation.cfm?id=210126.210131},
  keywords = {parallel I/O, distributed file system, disk striping, pario-bib}
}

@InProceedings{hatcher:linda,
  author = {Philip J. Hatcher and Michael J. Quinn},
  title = {{C*-Linda:} {A} Programming Environment with Multiple Data-Parallel
  Modules and Parallel {I/O}},
  booktitle = {Proceedings of the Twenty-Fourth Annual Hawaii International
  Conference on System Sciences},
  year = {1991},
  pages = {382--389},
  keywords = {parallel I/O, Linda, data parallel, nCUBE, parallel graphics,
  heterogeneous computing, pario-bib},
  comment = {C*-Linda is basically a combination of C* and C-Linda. The model
  is that of several SIMD modules interacting in a MIMD fashion through a Linda
  tuple space. The modules are created using {\tt eval}, as in Linda. In this
  case, the compiler statically assigns each eval to a separate subcube on an
  nCUBE 3200, although they also talk about multiprogramming several modules on
  a subcube (not supported by VERTEX). They envision having separate modules
  running on the nCUBE's graphics processors, or having the file system
  directly talk to the tuple space, to support I/O. They also envision talking
  to modules elsewhere on a network, e.g., a workstation, through the tuple
  space. They reject the idea of sharing memory between modules due to the lack
  of synchrony between modules, and message passing because it is error-prone.}
}

@InProceedings{hayes:nCUBE,
  author = {John P. Hayes and Trevor N. Mudge and Quentin F. Stout and Stephen
  Colley and John Palmer},
  title = {Architecture of a Hypercube Supercomputer},
  booktitle = {Proceedings of the 1986 International Conference on Parallel
  Processing},
  year = {1986},
  pages = {653--660},
  publisher = {IEEE Computer Society Press},
  address = {St. Charles, IL},
  keywords = {hypercube, parallel architecture, nCUBE, pario-bib},
  comment = {Description of the first nCUBE, the NCUBE/ten. Good historical
  background about hypercubes. Talks about their design choices. Says a little
  about the file system --- basically just a way of mounting disks on top of
  each other, within the nCUBE and to other nCUBEs.}
}

@Article{hellwagner:pfs,
  author = {Hermann Hellwagner},
  title = {Design Considerations for Scalable Parallel File Systems},
  journal = {The Computer Journal},
  year = {1993},
  volume = {36},
  number = {8},
  pages = {741--755},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {An overview of the issues in designing a parallel file system,
  along with some early ideas for their own file system. They aim for a
  general-purpose system, and characterize the workload into three classes:
  independent, much like a timesharing system; cooperative-agents, like that
  expected by most current MIMD file systems; and single-agent, for
  data-parallel programs where a ``master'' process issues single large
  requests on behalf of many processes. Their design is heavily weighted to the
  assumption of shared memory, and in particular to a randomized shared memory
  (like RP3), so they don't worry about locality much. They say little about
  their interface, although they intend to stick to a Unix interface --- and
  Unix semantics --- as much as possible. The file system is essentially
  represented by a collection of shared data structures and many threads to
  manipulate those structures.}
}

@InProceedings{hemy:gigabit,
  author = {Michael Hemy and Peter Steenkiste},
  title = {Gigabit {I/O} for Distributed-Memory Machines: Architecture and
  Applications},
  booktitle = {Proceedings of Supercomputing '95},
  year = {1995},
  publisher = {IEEE Computer Society Press},
  address = {San Diego, CA},
  URL = {http://www.supercomp.org/sc95/proceedings/570_MHEM/SC95.HTM},
  keywords = {parallel network I/O, pario-bib},
  abstract = {Distributed-memory systems have traditionally had great
  difficulty performing network I/O at rates proportional to their
  computational power. The problem is that the network interface has to support
  network I/O for a supercomputer, using computational and memory bandwidth
  resources similar to those of a workstation. As a result, the network
  interface becomes a bottleneck. We implemented an architecture for network
  I/O for the iWarp system with the following two key characteristics: first,
  application-specific tasks are off-loaded from the network interface to the
  distributed-memory system, and second, these tasks are performed in close
  cooperation with the application. The network interface has been used by
  several applications for over a year. In this paper we describe the network
  interface software that manages the communication between the iWarp
  distributed-memory system and the network interface, we validate the main
  features of our network interface architecture based on application
  experience, and we discuss how this architecture can be used by other
  distributed-memory systems.},
  comment = {Parallel network I/O on the iWARP. Note proceedings only on CD-ROM
  and WWW.}
}

@InProceedings{henderson:shpio,
  author = {Mark Henderson and Bill Nickless and Rick Stevens},
  title = {A Scalable High-Performance {I/O} System},
  booktitle = {Proceedings of the Scalable High-Performance Computing
  Conference},
  year = {1994},
  pages = {79--86},
  keywords = {parallel I/O, pario-bib},
  comment = {Scalable I/O initiative intends to build a testbed. At Argonne,
  they have a 128-node SP-1 with a high-speed switch. 96 are compute nodes, 32
  are I/O nodes (128 MB RAM, 1 GB local disk, FibreChannel port). FibreChannel
  connects to RS/6000 which has 256 MB RAM, two 80 MB/s busses, and a HIPPI
  interface to a 220 GB RAID (level 1 or 5) and 6.4 TB tape robot. They run
  UniTree on all this. They use multiple files to get parallelism. FibreChannel
  with TCP/IP is the limiting factor. note they are focussing more on the
  external connectivity issues rather than on the internal file system.}
}

@Article{herbst:bottleneck,
  author = {Kris Herbst},
  title = {Trends in Mass Storage: vendors seek solutions to growing {I/O}
  bottleneck},
  journal = {Supercomputing Review},
  year = {1991},
  month = {March},
  pages = {46--49},
  keywords = {parallel I/O, disk media, optical disk, holographic storage,
  trends, tape storage, parallel transfer disk, disk striping, pario-bib},
  comment = {A good overview of the current state-of-the art in March 1991,
  including particular numbers and vendor names. They discuss disk media
  (density, rotation, etc.), parallel transfer disks, disk arrays, parity and
  RAID, HiPPI, tape archives, optical memory, and holographic storage. Rotation
  speeds can increase as diameter goes down. Density increases are often offset
  by slower head settling times. Disk arrays will hit their ``heydey'' in the
  1990s. Trend toward network-attached storage devices, that don't need a
  computer as a server.}
}

@MastersThesis{herland:mpvms,
  author = {Bjarne Geir Herland},
  title = {{MPVMS} --- {MasPar} Virtual Memory System},
  year = {1992},
  month = {July},
  school = {University of Bergen},
  address = {Bergen, Norway},
  keywords = {parallel I/O, virtual memory, SIMD, multiprocessor file system,
  pario-bib},
  comment = {He has an MPL (Maspar C) preprocessor that inserts code to allow
  you to make plural vectors and arrays pageable. The preprocessor inserts
  checks before every access to see whether you have that data in memory, and
  if not, to page it in. The preprocessor is supported by a run-time library.
  No compiler, OS, or hardware mods.}
}

@InProceedings{hersch:pixmap,
  author = {Roger D. Hersch},
  title = {Parallel Storage and Retrieval of Pixmap Images},
  booktitle = {Proceedings of the Twelfth IEEE Symposium on Mass Storage
  Systems},
  year = {1993},
  pages = {221--226},
  keywords = {parallel I/O, file system, pario-bib},
  comment = {Ways to arrange 2-d images on disk arrays that have multiple
  processors (like Datamesh), so that retrieval time for images or subimages is
  minimized.}
}

@Article{hey:parkbench,
  author = {Tony Hey and David Lancaster},
  title = {The Development of {Parkbench} and Performance Prediction},
  journal = {The International Journal of High Performance Computing
  Applications},
  year = {2000},
  month = {Fall},
  volume = {14},
  number = {3},
  pages = {205--215},
  URL = {http://www.ecs.soton.ac.uk/~djl/GENESIS/peis.ps.gz},
  keywords = {parallel I/O benchmarks, MPI-IO, pario-app, pario-bib}
}

@InProceedings{hidrobo:autonomic,
  author = {Francisco Hidrobo and Toni Cortes},
  title = {Towards an autonomic storage system to improve parallel {I/O}.},
  booktitle = {Proceedings of the 15th IASTED International Conference on
  Parallel and Distributed Computing and Systems},
  year = {2003},
  month = {November},
  pages = {122-127, vol 1},
  publisher = {ACTA Press},
  copyright = {(c)2004 IEE},
  address = {Marina del Rey, CA},
  keywords = {performance prediction, data placement, storage device modeling,
  parallel I/O, pario-bib},
  abstract = {In this paper, we present a mechanism able to predict the
  performance a given workload will achieve when running on a given storage
  device. This mechanism is composed by two modules. The first one is able to
  reproduce its behavior later on, without a new execution, even when the
  storage drives or data placement are modified. The second module is a drive
  modeler that is able to learn how storage drive works in an automatic way,
  just executing some synthetic tests. Once we have the workload and drive
  models, we can predict how well that application will perform on the selected
  storage device or devices or when the data placement is modified. The results
  presented in this paper will show that this prediction system achieves errors
  below 10% when compared to the real performance obtained. It is important to
  notice that the two modules will treat both the application and the storage
  device as black and will need no previous information about them. (20
  refs.)},
  comment = {Could not find a URL. See
  <http://www.actapress.com/proceedings/pdcs.htm> for proceedings information.}
}

@Article{hillis:cm5,
  author = {W. Daniel Hillis and Lewis W. Tucker},
  title = {The {CM-5} Connection Machine: A Scalable Supercomputer},
  journal = {Communications of the ACM},
  year = {1993},
  month = {November},
  volume = {36},
  number = {11},
  pages = {31--40},
  keywords = {parallel architecture, SIMD, MIMD, parallel I/O, pario-bib},
  comment = {A good basic citation for the CM-5 architecture. A little bit
  about I/O.}
}

@InProceedings{hirano:deadlock,
  author = {Satoshi Hirano and Masaru Kitsuregawa and Mikio Takagi},
  title = {A High Performance Parallel {I/O} Model and its Deadlock
  Prevention/Avoidance Technique on the Super Database Computer {(SDC)}},
  booktitle = {Proceedings of the Twenty-Sixth Annual Hawaii International
  Conference on System Sciences},
  year = {1993},
  volume = {I},
  pages = {21--30},
  keywords = {parallel database, concurrency control, deadlock, parallel I/O,
  pario-bib},
  comment = {Most interesting to me in this paper is their discussion of the
  ``container model,'' in which they claim they allow the processors to be
  driven by the I/O devices. What it boils down to is a producer-consumer queue
  of containers, each of which contains a task (some tuples and presumably some
  instruction about what to do with them). The disks put data into containers
  and stick them on the queue; the processors repeatedly pull containers
  (tasks) from the queue and process them. They don't describe the activity of
  the disks in much detail. See kitsuregawa:sdc.}
}

@InProceedings{ho:reorganization,
  author = {T. K. Ho and Jack Y. B. Lee},
  title = {A Row-Permutated Data Reorganization Algorithm for Growing
  Server-less Video-on-Demand Systems},
  booktitle = {Proceedings of the Third IEEE/ACM International Symposium on
  Cluster Computing and the Grid},
  year = {2003},
  month = {May},
  pages = {44--51},
  publisher = {IEEE Computer Society Press},
  address = {Tokyo, Japan},
  URL =
  {http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190044abs.htm},
  keywords = {data reorganization, video on demand, video streaming,
  pario-bib},
  abstract = {Recently, a new server-less architecture is proposed for building
  low-cost yet scalable video streaming systems. Compare to conventional
  client-server-based video streaming systems, this server-less architecture
  does not need any dedicated video server and yet is highly scalable. Video
  data are distributed among user hosts and these hosts cooperate to stream
  video data to one another. Thus as new hosts join the system, they also add
  streaming and storage capacity to absorb the added streaming load. This study
  investigates the data reorganization problem when growing a server-less video
  streaming system. Specifically, as video data are distributed among user
  hosts, these data will need to be redistributed to newly joined hosts to
  utilize their storage and streaming capacity. This study presents a new data
  reorganization algorithm that allows controllable tradeoff between data
  reorganization overhead and streaming load balance.}
}

@InProceedings{holland:decluster,
  author = {Mark Holland and Garth Gibson},
  title = {Parity Declustering for Continuous Operation in Redundant Disk
  Arrays},
  booktitle = {Proceedings of the Fifth International Conference on
  Architectural Support for Programming Languages and Operating Systems},
  year = {1992},
  pages = {23--35},
  later = {holland:bdecluster},
  URL =
  {http://www.cs.cmu.edu/afs/cs.cmu.edu/project/pdl/ftp/Declustering/ASPLOS.ps},
  keywords = {parity, declustering, striping, disk array, redundancy,
  reliability, pario-bib},
  abstract = {We describe and evaluate a strategy for declustering the parity
  encoding in a redundant disk array. This declustered parity organization
  balances cost against data reliability and performance during failure
  recovery. It is targeted at highly-available parity-based arrays for use in
  continuous- operation systems. It improves on standard parity organizations
  by reducing the additional load on surviving disks during the reconstruction
  of a failed disk's contents. This yields higher user throughput during
  recovery, and/or shorter recovery time. \par We first address the generalized
  parity layout problem, basing our solution on balanced incomplete and
  complete block designs. A software implementation of declustering is then
  evaluated using a disk array simulator under a highly concurrent workload
  comprised of small user accesses. We show that declustered parity penalizes
  user response time while a disk is being repaired (before and during its
  recovery) less than comparable non-declustered (RAID5) organizations without
  any penalty to user response time in the fault-free state. \par We then show
  that previously proposed modifications to a simple, single-sweep
  reconstruction algorithm further decrease user response times during
  recovery, but, contrary to previous suggestions, the inclusion of these
  modifications may, for many configurations, also slow the reconstruction
  process. This result arises from the simple model of disk access performance
  used in previous work, which did not consider throughput variations due to
  positioning delays.}
}

@Article{holland:on-line,
  author = {Mark Holland and Garth A. Gibson and Daniel P. Siewiorek},
  title = {Architectures and Algorithms for On-Line Failure Recovery in
  Redundant Disk Arrays},
  journal = {Journal of Distributed and Parallel Databases},
  year = {1994},
  month = {July},
  volume = {2},
  number = {3},
  pages = {295--335},
  URL =
  {http://www.cs.cmu.edu/afs/cs.cmu.edu/project/pdl/ftp/Declustering/DAPD.ps},
  keywords = {parallel I/O, disk array, RAID, redundancy, reliability,
  pario-bib},
  abstract = {The performance of traditional RAID Level 5 arrays is, for many
  applications, unacceptably poor while one of its constituent disks is
  non-functional. This paper describes and evaluates mechanisms by which this
  disk array failure-recovery performance can be improved. The two key issues
  addressed are the data layout, the mapping by which data and parity blocks
  are assigned to physical disk blocks in an array, and the reconstruction
  algorithm, which is the technique used to recover data that is lost when a
  component disk fails. \par The data layout techniques this paper investigates
  are instantiations of the declustered parity organization, a derivative of
  RAID Level 5 that allows a system to trade some of its data capacity for
  improved failure-recovery performance. We show that our instantiations of
  parity declustering improve the failure-mode performance of an array
  significantly, and that a parity-declustered architecture is preferable to an
  equivalent-size multiple-group RAID Level 5 organization in environments
  where failure-recovery performance is important. The presented analyses also
  include comparisons to a RAID Level 1 (mirrored disks) approach. \par With
  respect to reconstruction algorithms, this paper describes and briefly
  evaluates two alternatives stripe-oriented reconstruction and disk-oriented
  reconstruction, and establishes that the latter is preferable as it provides
  faster reconstruction. The paper then revisits a set of previously-proposed
  reconstruction optimizations, evaluating their efficacy when used in
  conjunction with the disk-oriented algorithm. The paper concludes with a
  section on the reliability versus capacity trade-off that must be addressed
  when designing large arrays.}
}

@InProceedings{holland:recovery,
  author = {Mark Holland and Garth A. Gibson and Daniel P. Siewiorek},
  title = {Fast, On-Line Failure Recovery in Redundant Disk Arrays},
  booktitle = {Proceedings of the 23rd Annual International Symposium on
  Fault-Tolerant Computing},
  year = {1993},
  pages = {421--433},
  URL =
  {http://www.cs.cmu.edu/afs/cs.cmu.edu/project/pdl/ftp/Declustering/FTCS.ps},
  keywords = {parallel I/O, disk array, RAID, redundancy, reliability,
  pario-bib},
  abstract = {This paper describes and evaluates two algorithms for performing
  on-line failure recovery (data reconstruction) in redundant disk arrays. It
  presents an implementation of disk-oriented reconstruction, a data recovery
  algorithm that allows the reconstruction process to absorb essentially all
  the disk bandwidth not consumed by the user processes, and then compares this
  algorithm to a previously proposed parallel stripe-oriented approach. The
  disk-oriented approach yields better overall failure-recovery performance.
  \par The paper evaluates performance via detailed simulation on two different
  disk array architectures: the RAID level 5 organization, and the declustered
  parity organization. The benefits of the disk-oriented algorithm can be
  achieved using controller or host buffer memory no larger than the size of
  three disk tracks per disk in the array. This paper also investigates the
  tradeoffs involved in selecting the size of the disk accesses used by the
  failure recovery process.}
}

@PhdThesis{holland:thesis,
  author = {Mark Holland},
  title = {On-Line Data Reconstruction in Redundant Disk Arrays},
  year = {1994},
  month = {April},
  school = {Carnegie Mellon University},
  URL =
  {http://www.cs.cmu.edu/afs/cs.cmu.edu/project/pdl/ftp/Declustering/Thesis.ps},
  keywords = {parallel I/O, disk arrays, RAID, redundancy, reliability,
  pario-bib},
  abstract = {There exists a wide variety of applications in which data
  availability must be continuous, that is, where the system is never taken
  off-line and any interruption in the accessibility of stored data causes
  significant disruption in the service provided by the application. Examples
  include on-line transaction processing systems such as airline reservation
  systems, and automated teller networks in banking systems. In addition, there
  exist many applications for which a high degree of data availability is
  important, but continuous operation is not required. An example is a research
  and development environment, where access to a centrally-stored CAD system is
  often necessary to make progress on a design project. These applications and
  many others mandate both high performance and high availability from their
  storage subsystems. \par Parity-based redundant disk arrays are very
  attractive storage alternatives for these systems because they offer both low
  cost per megabyte and high data reliability. Unfortunately such systems
  exhibit poor availability characteristics; their performance is severely
  degraded in the presence of a disk failure. This dissertation addresses the
  design of parity-based redundant disk arrays that offer dramatically higher
  levels of performance in the presence of failure than systems comprising the
  current state of the art. \par We consider two primary aspects of the
  failure-recovery problem: the organization of the data and redundancy in the
  array, and the algorithm used to recover the lost data. We apply results from
  combinatorial theory to generate data and parity organizations that minimize
  performance degradation during failure recovery by evenly distributing all
  failure-induced workload over a larger-than-minimal collection of disks. We
  develop a reconstruction algorithm that is able to absorb for
  failure-recovery essentially all of the array's bandwidth that is not
  absorbed by the application process(es). Additionally, we develop a design
  for a redundant disk array targeted at extremely high availability through
  extremely fast failure recovery. This development also demonstrates the
  generality of the presented techniques.},
  comment = {Garth Gibson, advisor.}
}

@InProceedings{hou:disk,
  author = {Robert Y. Hou and Gregory R. Ganger and Yale N. Patt and Charles E.
  Gimarc},
  title = {Issues and Problems in the {I/O} Subsystem, Part {I} --- {The}
  Magnetic Disk},
  booktitle = {Proceedings of the Twenty-Fifth Annual Hawaii International
  Conference on System Sciences},
  year = {1992},
  pages = {48--57},
  keywords = {parallel I/O, pario-bib},
  comment = {A short summary of disk I/O issues: disk technology, latency
  reduction, parallel I/O, etc..}
}

@InProceedings{hsiao:decluster,
  author = {Hui-I Hsiao and David DeWitt},
  title = {{Chained Declustering}: {A} New Availability Strategy for
  Multiprocessor Database Machines},
  booktitle = {Proceedings of 6th International Data Engineering Conference},
  year = {1990},
  pages = {456--465},
  keywords = {disk array, reliability, parallel I/O, pario-bib},
  comment = {Chained declustering has cost like mirroring, since it replicates
  each block, but has better load increase during failure than mirrors,
  interleaved declustering, or RAID. (Or parity striping (my guess)). Has
  reliability between that of mirrors and RAID, and much better than
  interleaved declustering. Would also be much easier in a distributed
  environment. See hsiao:diskrep.}
}

@InProceedings{hsiao:diskrep,
  author = {Hui-I Hsiao and David DeWitt},
  title = {A Performance Study of Three High Availability Data Replication
  Strategies},
  booktitle = {Proceedings of the First International Conference on Parallel
  and Distributed Information Systems},
  year = {1991},
  month = {December},
  pages = {18--28},
  later = {hsiao:diskrep2},
  keywords = {disk array, reliability, disk mirroring, parallel I/O,
  pario-bib},
  comment = {Compares mirrored disks (MD) with interleaved declustering (ID)
  with chained declustering (CD). ID and CD found to have much better
  performance in normal and failure modes. See hsiao:decluster.}
}

@Article{hsiao:diskrep2,
  author = {Hui-I Hsiao and David DeWitt},
  title = {A Performance Study of Three High Availability Data Replication
  Strategies},
  journal = {Journal of Distributed and Parallel Databases},
  year = {1993},
  month = {January},
  volume = {1},
  number = {1},
  pages = {53--79},
  earlier = {hsiao:diskrep},
  keywords = {disk array, reliability, disk mirroring, parallel I/O,
  pario-bib},
  comment = {See hsiao:diskrep.}
}

@Article{hsieh:vod,
  author = {Jenwei Hsieh and Mengjou Lin and Thomas M. Ruwart},
  title = {Performance of a Mass Storage System for Video-on-Demand},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1995},
  month = {November},
  volume = {30},
  number = {2},
  pages = {147--167},
  publisher = {Academic Press},
  keywords = {multimedia server, video on demand, pario-bib}
}

@InCollection{hu:brapid-cache,
  author = {Yiming Hu and Qing Yang and Tycho Nightingale},
  title = {{RAPID-Cache}--- A Reliable and Inexpensive Write Cache for Disk
  {I/O} Systems},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {15},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {211--223},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {hu:rapid-cache},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, disk cache, disk striping, disk aray, pario-bib},
  comment = {Part of jin:io-book; reformatted version of hu:rapid-cache.}
}

@InCollection{hu:rapid-cache,
  author = {Yiming Hu and Qing Yang and Tycho Nightingale},
  title = {{RAPID-Cache}--- A Reliable and Inexpensive Write Cache for Disk
  {I/O} Systems},
  booktitle = {Proceedings of the 5th International Symposium on High
  Performance Computer Architecture},
  year = {2001},
  month = {January},
  pages = {204--213},
  publisher = {IEEE Computer Society Press},
  later = {hu:brapid-cache},
  URL = {http://computer.org/proceedings/hpca/0004/00040204abs.htm},
  keywords = {parallel I/O, disk cache, disk striping, disk aray, pario-bib},
  abstract = {This paper presents a new cache architecture called RAPID-Cache
  for Redundant, Asymmetrically Parallel, and Inexpensive Disk Cache. A typical
  RAPID-Cache consists of two redundant write buffers on top of a disk system.
  One of the buffers is a primary cache made of RAM or NVRAM and the other is a
  backup cache containing a two level hierarchy: a small NVRAM buffer on top of
  a log disk. The backup cache has nearly equivalent write performance as the
  primary RAM cache, while the read performance of the backup cache is not as
  critical because normal read operations are performed through the primary RAM
  cache and reads from the backup cache happen only during error recovery
  periods. The RAPID-Cache presents an asymmetric architecture with a
  fast-write-fast-read RAM being a primary cache and a fast-write-slow-read
  NVRAM-disk hierarchy being a backup cache. The asymmetric cache architecture
  allows cost-effective designs for very large write caches for high-end disk
  I/O systems that would otherwise have to use dual-copy, costly NVRAM caches.
  It also makes it possible to implement reliable write caching for low-end
  disk I/O systems since the RAPID-Cache makes use of inexpensive disks to
  perform reliable caching. Our analysis and trace-driven simulation results
  show that the RAPID-Cache has significant reliability/cost advantages over
  conventional single NVRAM write caches and has great cost advantages over
  dual-copy NVRAM caches. The RAPID-Cache architecture opens a new dimension
  for disk system designers to exercise trade-offs among performance,
  reliability and cost.}
}

@InProceedings{hua:annealing,
  author = {Kien A. Hua and S. D. Lang and Wen K. Lee},
  title = {A decomposition-based simulated annealing technique for data
  clustering},
  booktitle = {Proceedings of the Thirteenth ACM Symposium on Principles of
  Database Systems},
  year = {1994},
  pages = {117--128},
  publisher = {ACM Press},
  URL = {http://www.acm.org/pubs/citations/proceedings/pods/182591/p117-hua},
  keywords = {out of core, information retrieval, parallel I/O, pario-bib},
  abstract = {It has been demonstrated that simulated annealing provides
  high-quality results for the data clustering problem. However, existing
  simulated annealing schemes are memory-based algorithms; they are not suited
  for solving large problems such as data clustering which typically are too
  big to fit in the memory space in its entirety. Various buffer replacement
  policies, assuming either temporal or spatial locality, are not useful in
  this case since simulated annealing is based on a randomized search process.
  Poor locality of references will cause the memory to thrash because too many
  replacements are required. This phenomenon will incur excessive disk accesses
  and force the machine to run at the speed of the I/O subsystem. In this
  paper, we formulate the data clustering problem as a graph partition problem
  (GPP), and propose a decomposition-based approach to address the issue of
  excessive disk accesses during annealing. We apply the statistical sampling
  technique to randomly select subgraphs of the GPP into memory for annealing.
  Both the analytical and experimental studies indicate that the
  decomposition-based approach can dramatically reduce the costly disk I/O
  activities while obtaining excellent optimized results.}
}

@InCollection{huber:bppfs,
  author = {James V. {Huber, Jr.} and Christopher L. Elford and Daniel A. Reed
  and Andrew A. Chien and David S. Blumenthal},
  title = {{PPFS}: A High Performance Portable Parallel File System},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {22},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {330--343},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {huber:ppfs},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel file system, parallel I/O, pario-bib},
  comment = {Part of jin:io-book, revised version of huber:ppfs.}
}

@MastersThesis{huber:msthesis,
  author = {James V. {Huber, Jr.}},
  title = {{PPFS}: An Experimental File System for High Performance Parallel
  Input/Output},
  year = {1995},
  month = {February},
  school = {Department of Computer Science, University of Illinois at Urbana
  Champaign},
  URL = {http://www-pablo.cs.uiuc.edu/Papers/huberMS.html},
  keywords = {parallel file system, pario-bib},
  abstract = {The I/O problem is described in the context of parallel
  scientific applications. A user-level input/output library, PPFS, is
  introduced to address these issues. The design and implementation of PPFS are
  presented. Some simple performance benchmarks are reported. Experiments on
  two production-scale applications are given.},
  comment = {He describes the design and implementation of PPFS, along with
  some experimental results. PPFS is a C++ library and a set of servers that
  implement a parallel file system on top of unix on a cluster or a Paragon.
  Interesting features of PPFS include: files are a sequence of records (fixed
  size or variable size), read_next and read_any operations, a no-extend option
  to reduce overhead of maintaining file-size information, client and server
  caching, intermediate caching agents for consistency, prefetching and write
  behind, and user-defined declustering and indexing policies. User-defined
  changes actually have to be precompiled into the server programs. Good
  results in comparison to PFS on the Paragon, though that doesn't say much.
  They are porting it to the SP-2.}
}

@InProceedings{huber:ppfs,
  author = {Jay Huber and Christopher L. Elford and Daniel A. Reed and Andrew
  A. Chien and David S. Blumenthal},
  title = {{PPFS}: A High Performance Portable Parallel File System},
  booktitle = {Proceedings of the 9th ACM International Conference on
  Supercomputing},
  year = {1995},
  month = {July},
  pages = {385--394},
  publisher = {ACM Press},
  address = {Barcelona},
  earlier = {huber:ppfs-tr},
  later = {huber:bppfs},
  URL = {http://www-pablo.cs.uiuc.edu/Papers/ICS95-ppfs.html},
  keywords = {parallel file system, parallel I/O, pario-bib},
  abstract = {Rapid increases in processor performance over the past decade
  have outstripped performance improvements in input/output devices, increasing
  the importance of input/output performance to overall system performance.
  Further, experience has shown that the performance of parallel input/output
  systems is particularly sensitive to data placement and data management
  policies, making good choices critical. To explore this vast design space, we
  have developed a user-level library, the Portable Parallel File System
  (PPFS), which supports rapid experimentation and exploration. The PPFS
  includes a rich application interface, allowing the application to advertise
  access patterns, control caching and prefetching, and even control data
  placement. PPFS is both extensible and portable, making possible a wide range
  of experiments on a broad variety of platforms and configurations. Our
  initial experiments, based on simple benchmarks and two application programs,
  show that tailoring policies to input/output access patterns yields
  significant performance benefits, often improving performance by nearly an
  order of magnitude.}
}

@TechReport{huber:ppfs-scenarios,
  author = {Jay Huber and Chris Kuszmaul and Tara Madhyastha and Chris Elford},
  title = {Scenarios for the Portable Parallel File System},
  year = {1993},
  month = {November},
  institution = {University of Illinois at Urbana-Champaign},
  URL = {http://www-pablo.cs.uiuc.edu/Papers/PPFS-scenario.ps.Z},
  keywords = {parallel file system, parallel I/O, pario-bib},
  comment = {See also elford:ppfs-tr, huber:ppfs.}
}

@TechReport{huber:ppfs-tr,
  author = {Jay Huber and Christopher L. Elford and Daniel A. Reed and Andrew
  A. Chien and David S. Blumenthal},
  title = {{PPFS}: A High Performance Portable Parallel File System},
  year = {1995},
  month = {January},
  number = {UIUCDCS-R-95-1903},
  institution = {University of Illinois at Urbana Champaign},
  later = {huber:ppfs},
  URL = {http://www-pablo.cs.uiuc.edu/Papers/PPFS-TR.html},
  keywords = {parallel file system, pario-bib},
  abstract = {Rapid increases in processor performance over the past decade
  have outstripped performance improvements in input/output devices, increasing
  the importance of input/output performance to overall system performance.
  Further, experience has shown that the performance of parallel data
  management policies, making good choices critical. To explore this vast
  design space, we have developed a user-level library, the Portable Parallel
  File System (PPFS), which supports rapid experimentation and exploration. The
  PPFS includes a rich application interface, allowing the application to
  advertise access patterns, control caching and prefetching, and even control
  data placement. PPFS is both extensible and portable, making possible a wide
  range of experiments on a broad variety of platforms and configurations. Our
  initial experiments, based on on simple benchmarks and two application
  programs, show that tailoring policies to input/output access patterns yields
  significant performance benefits, often improving performance by nearly an
  order of magnitude.},
  comment = {They have built a user-level library that implements a parallel
  file system on top of a set of vanilla Unix file systems. Their goals include
  flexibility and portability, so they can use PPFS to explore issues in
  parallel I/O. They allow the application to have lots of control over data
  distribution, cache and prefetch policies, etc. They support fixed- and
  variable-length records. They support client, server, and shared caches. This
  TR includes syntax and specs for all functions. They include performance for
  synthetic benchmarks and application codes, compared with Intel Paragon PFS
  (which is admittedly not a very tough competitor).}
}

@MastersThesis{hubovskykunz:msthesis,
  author = {Rainer Hubovsky and Florian Kunz},
  title = {Dealing with Massive Data: from Parallel I/O to Grid I/O},
  year = {2004},
  month = {January},
  school = {Vienna University of Technology},
  address = {Vienna, Austria},
  URL = {http://www.cs.dartmouth.edu/pario/hubovsky_dictionary.pdf},
  keywords = {parallel i/o, cluster i/o, grid i/o, distributed computing,
  pario-bib},
  abstract = {Increasing requirements in HPC led to improvements of CPU power,
  but bandwidth of I/O subsystems does not keep up with the performance of
  processors any more. This problem is commonly known as the I/O bottleneck.
  Additionally, new and stimulating data-intensive problems in biology,
  physics, astronomy, space exploration, and human genom research arise, which
  bring new high-performance applications dealing with massive data spread over
  globally distributed storage resources. Therefore research in HPC focuses
  more on I/O systems: all leading hardware vendors of multiprocessor systems
  provided powerful concurrent I/O subsystems. In accordance researchers focus
  on the design of appropriate programming tools and models to take advantage
  of the available hardware resources. Numerous projects about this topic have
  appeared, from which a large and unmanageable quantity of publications have
  come. These publications concern themselves to a large extent with very
  special problems. Due to the time of their appearance the few overview papers
  deal with Parallel I/O or Cluster I/O. Substantial progress has been made in
  these research areas since then. Grid Computing has emerged as an important
  new field, distinguished from conventional Distributed Computing by its focus
  on large-scale resource sharing, innovative applications and, in some cases,
  high-performance orientation. Over the past five years, research and
  development efforts within the Grid community have produced protocols,
  services and tools that address precisely the challenges that arise when we
  try to build Grids, I/O being an important part of it. Therefore our work
  gives an overview of I/O in HPC.},
  comment = {Like stockinger:dictionary, this master's thesis categorizes and
  describes a large set of parallel I/O-related projects and applications.}
}

@MastersThesis{husmann:format,
  author = {Harlan Edward Husmann},
  title = {High-Speed Format Conversion and Parallel {I/O} in Numerical
  Programs},
  year = {1984},
  month = {January},
  school = {Department of Computer Science, Univ. of Illinois at
  Urbana-Champaign},
  note = {Available as TR number UIUCDCS-R-84-1152},
  keywords = {parallel I/O, I/O, pario-bib},
  comment = {Does FORTRAN format conversion in software in parallel or in
  hardware, to obtain good speedups for lots of programs. However he found that
  increasing the I/O bandwidth was the most significant change that could be
  made in the parallel program.}
}

@Article{hwang:pvfs-cache,
  author = {In-Chul Hwang and Hojoong Kim and Hanjo Jung and Dong-Hwan Kim and
  Hojin Ghim and Seung-Ryoul Maeng and Jung-Wan Cho},
  title = {Design and implementation of the cooperative cache for {PVFS}},
  journal = {Lecture Notes in Computer Science},
  booktitle = {4th International Conference on Computational Science (ICCS
  2004); June 6-9, 2004; Krakow, POLAND},
  editor = {BuBak, M; VanAlbada, GD; Sloot, PMA; Dongarra, JJ},
  year = {2004},
  month = {June},
  volume = {3036},
  pages = {43--50},
  institution = {Korea Adv Inst Sci \& Technol, Dept Elect Engn \& Comp Sci,
  Div Comp Sci, 373-1 Kusung Dong, Taejon 305701, South Korea; Korea Adv Inst
  Sci \& Technol, Dept Elect Engn \& Comp Sci, Div Comp Sci, Taejon 305701,
  South Korea},
  publisher = {Springer-Verlag},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL =
  {http://springerlink.metapress.com/openurl.asp?genre=article&issn=0302-9743&volume=3036&spage=43},
  keywords = {PVFS, cooperative cache, pario-bib},
  abstract = {Recently, there have been many efforts to get high performance in
  cluster computing with inexpensive PCs connected through high-speed networks.
  Some of them were to provide high bandwidth and parallelism in file service
  using a distributed file system. Other researches for distributed file
  systems include the cooperative cache that reduces servers' load and improves
  overall performance. The cooperative cache shares file caches among clients
  so that a client can request a file to another client, not to the server,
  through inter-client message passing. In various distributed file systems,
  PVFS (Parallel Virtual File System) provides high performance with parallel
  I/O in Linux widely used in cluster computing. However, PVFS doesn't support
  any file cache facility. This paper describes the design and implementation
  of the cooperative cache for PVFS (Coopc-PVFS). We show the efficiency of
  Coopc-PVFS in comparison to original PVFS. As a result, the response time of
  Coopc-PVFS is shorter than or similar to that of original PVFS.}
}

@InProceedings{hwang:raid-x,
  author = {Kai Hwang and Hai Jin and Roy Ho},
  title = {{RAID-x}: A New Distributed Disk Array for {I/O}-Centric Cluster
  Computing},
  booktitle = {Proceedings of the Ninth IEEE International Symposium on High
  Performance Distributed Computing},
  year = {2000},
  month = {August},
  pages = {279--287},
  publisher = {IEEE Computer Society Press},
  address = {Pittsburgh, PA},
  later = {hwang:braid-x},
  URL = {http://www.computer.org/proceedings/hpdc/0783/07830279abs.htm},
  keywords = {parallel I/O, disk array, disk striping, RAID, pario-bib},
  abstract = {A new RAID-x (redundant array of inexpensive disks at level x)
  architecture is presented for distributed I/O processing on a serverless
  cluster of computers. The RAID-x architecture is based on a new concept of
  orthogonal striping and mirroring (OSM) across all distributed disks in the
  cluster. The primary advantages of this OSM approach lie in: (1) a
  significant improvement in parallel I/O bandwidth, (2) hiding disk mirroring
  overhead in the background, and (3) greatly enhanced scalability and
  reliability in cluster computing applications. All claimed advantages are
  substantiated with benchmark performance results on the Trojans cluster built
  at USC in 1999. Throughout the paper, we discuss the issues of scalable I/O
  performance, enhanced system reliability, and striped checkpointing on
  distributed RAID-x in a serverless cluster environment.}
}

@InProceedings{iannizzotto:avda,
  author = {G. Iannizzotto and A. Puliafito and S. Riccobene and L. Vita},
  title = {{AVDA}: A disk array system for multimedia services},
  booktitle = {Proceedings of the 1995 International Conference on High
  Performance Computing},
  year = {1995},
  month = {December},
  pages = {160--165},
  address = {New Delhi, India},
  keywords = {disk array, multimedia, parallel I/O, pario-bib},
  comment = {Petri-net model of disk array using Information-Dispersal
  Algorithm (IDA) to stripe data. Continuous-media workload.}
}

@Misc{ibm:sp1,
  key = {IBM},
  title = {{IBM 9076 Scalable POWERparallel 1}: General Information},
  year = {1993},
  month = {February},
  howpublished = {IBM brochure GH26-7219-00},
  keywords = {multiprocessor architecture, parallel I/O, pario-bib},
  comment = {See also information about Vesta file system, corbett:vesta.}
}

@Booklet{intel:examples,
  key = {Intel},
  title = {Concurrent {I/O} Application Examples},
  year = {1989},
  howpublished = {Intel Corporation Background Information},
  keywords = {file access pattern, parallel I/O, Intel iPSC/2, hypercube,
  pario-bib},
  comment = {Lists several examples and the amount and types of data they
  require, and how much bandwidth. Fluid flow modeling, Molecular modeling,
  Seismic processing, and Tactical and strategic systems.}
}

@Booklet{intel:ipsc2io,
  key = {Intel},
  title = {{iPSC/2} {I/O} Facilities},
  year = {1988},
  howpublished = {Intel Corporation},
  note = {Order number 280120-001},
  keywords = {parallel I/O, hypercube, Intel iPSC/2, pario-bib},
  comment = {Simple overview, not much detail. See intel:ipsc2, pierce:pario,
  asbury:fortranio. Separate I/O nodes from compute nodes. Each I/O node has a
  SCSI bus to the disks, and communicates with other nodes in the system via
  Direct-Connect hypercube routing.}
}

@Booklet{intel:paragon,
  key = {Intel},
  title = {Paragon {XP/S} Product Overview},
  year = {1991},
  howpublished = {Intel Corporation},
  keywords = {parallel architecture, parallel I/O, Intel, pario-bib},
  comment = {Not a bad glossy. See also esser:paragon.}
}

@Misc{intelio,
  key = {Intel},
  title = {Intel beefs up its {iPSC/2} supercomputer's {I/O} and memory
  capabilities},
  year = {1988},
  month = {November},
  volume = {61},
  number = {11},
  pages = {24},
  howpublished = {Electronics},
  keywords = {parallel I/O, hypercube, Intel iPSC/2, pario-bib}
}

@Book{iopads-book,
  title = {Input/Output in Parallel and Distributed Computer Systems},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  editor = {Ravi Jain and John Werth and James C. Browne},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  publisher = {Kluwer Academic Publishers},
  keywords = {parallel I/O, parallel I/O architecture, parallel I/O algorithm,
  multiprocessor file system, workload characterization, parallel file access
  pattern, pario-bib},
  comment = {A book containing papers from IOPADS '94 and IOPADS '95, plus
  several survey/tutorial papers. See the bib entries with cross-ref to
  iopads-book.}
}

@Proceedings{ipps-io93,
  title = {Proceedings of the IPPS~'93 Workshop on Input/Output in Parallel
  Computer Systems},
  editor = {Ravi Jain and John Werth and J. C. Browne},
  year = {1993},
  month = {April},
  address = {Newport Beach, CA},
  note = {Some papers also published in Computer Architecture News 21(5),
  December 1993},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {The entire proceedings is about parallel I/O.}
}

@Proceedings{ipps-io94,
  title = {Proceedings of the IPPS~'94 Workshop on Input/Output in Parallel
  Computer Systems},
  editor = {Ravi Jain and John Werth and J. C. Browne},
  year = {1994},
  month = {April},
  note = {Some papers also published in Computer Architecture News 22(4),
  September 1994},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {The entire proceedings is about parallel I/O.}
}

@InBook{isaila:arch,
  author = {Florin Isaila},
  title = {An overview of file system architectures.},
  booktitle = {Algorithms for Memory Hierarchies. Advanced Lectures},
  chapter = {13},
  editor = {Meyer, U.; Sanders, P.; Sibeyn, J.},
  year = {2003},
  month = {March},
  series = {Lecture Notes in Computer Science},
  pages = {273--289},
  publisher = {Springer-Verlag},
  copyright = {(c)2004 IEE},
  address = {Dagstuhl, Germany},
  URL = {http://www.ipd.uka.de/~florin/Publications/dagstuhl.pdf},
  keywords = {survey, file system architecture, pario-bib},
  abstract = {We provide an overview of different file system architectures. We
  show the influence of I/O access pattern studies results on file system
  design. We present techniques, algorithms and data structures used in file
  system implementations. We overview issues related to both local and
  distributed file systems. We describe distributed file system architectures
  for different kinds of network connectivity: tightly-connected networks
  (clusters and supercomputers), loosely-connected networks (computational
  grids) or disconnected computers (mobile computing). File system
  architectures for both network-attached and computer-attached storage are
  reviewed. We show how the parallel file systems address the requirements of
  I/O bound parallel applications. Different file sharing semantics in
  distributed and parallel file systems are explored. We also present how
  efficient metadata management can be realized in journaled file systems.}
}

@Article{isaila:clusterfile,
  author = {Florin Isaila and Walter F. Tichy},
  title = {Clusterfile: a flexible physical layout parallel file system},
  journal = {Concurrency and Computation},
  year = {2003},
  volume = {15},
  number = {7/8},
  pages = {653--679},
  publisher = {Wiley},
  URL =
  {http://www3.interscience.wiley.com/cgi-bin/abstract/104524121/ABSTRACT},
  URLpdf =
  {http://www3.interscience.wiley.com/cgi-bin/fulltext/104524121/PDFSTART},
  keywords = {parallel file system, parallel I/O, pario-bib},
  abstract = {This paper presents Clusterfile, a parallel file system that
  provides parallel file access on a cluster of computers. We introduce a file
  partitioning model that has been used in the design of Clusterfile. The model
  uses a data representation that is optimized for multidimensional array
  partitioning while allowing arbitrary partitions. The paper shows how the
  file model can be employed for file partitioning into both physical subfiles
  and logical views. We also present how the conversion between two partitions
  of the same file is implemented using a general memory redistribution
  algorithm. We show how we use the algorithm to optimize non-contiguous read
  and write operations. The experimental results include performance
  comparisons with the Parallel Virtual File System (PVFS) and an MPI-IO
  implementation for PVFS.}
}

@InProceedings{isaila:integrating,
  author = {Florin Isaila and Guido Malpohl and Vlad Olaru and Gabor Szeder and
  Walter Tichy},
  title = {Integrating collective {I/O} and cooperative caching into the
  "clusterfile" parallel file system},
  booktitle = {Proceedings of the 18th Annual International Conference on
  Supercomputing},
  year = {2004},
  month = {July},
  pages = {58--67},
  publisher = {ACM Press},
  copyright = {(c)2004 Elsevier Engineering Information, Inc.},
  address = {Sain-Malo, France},
  URL = {http://doi.acm.org/10.1145/1006209.1006219},
  keywords = {disk-directed I/O, two-phase I/O, clusterfile parallel file
  system, cooperative cache, pario-bib},
  abstract = {This paper presents the integration of two collective I/O
  techniques into the Clusterfile parallel file system : disk-directed I/O and
  two-phase I/O. We show that global cooperative cache management improves the
  collective I/O performance. The solution focuses on integrating disk
  parallelism with other types of parallelism: memory (by buffering and caching
  on several nodes), network (by parallel I/O scheduling strategies) and
  processors (by redistributing the I/O related computation over several
  nodes). The performance results show considerable throughput increases over
  ROMIO's extended two-phase I/O.}
}

@InProceedings{isaila:viewio,
  author = {Florin Isaila and Walter F. Tichy},
  title = {View I/O: improving the performance of non-contiguous I/O.},
  booktitle = {IEEE International Conference on Cluster Computing},
  year = {2003},
  month = {December},
  pages = {336--343},
  publisher = {IEEE Computer Society Press},
  address = {Hong Kong, China},
  URL = {http://www.ipd.uka.de/~florin/Publications/mypaper.pdf},
  keywords = {non-contiguous I/O, parallel file structure, pario-bib},
  abstract = {This paper presents view I/O, a non-contiguous parallel I/O
  technique. We show that the linear file model may be an unsuitable
  abstraction for non-contiguous I/O optimizations. Additionally, the poor
  cooperation between a file system and an I/O library like MPI-IO may
  drastically affect the performance. View I/O has detailed knowledge about
  parallel structure of a file and about the potential access pattern and
  exploits it in order to improve performance. The access overhead is reduced
  by using a strategy "declare once, use several times" and by file off-set
  compaction. We compare and contrast view I/O with other non-contiguous I/O
  methods. Our measurements on a cluster of computers indicate a significant
  performance improvement over other approaches. (15 refs.)}
}

@InProceedings{itoh:pimos,
  author = {Fumihide Itoh and Takashi Chikayama and Takeshi Mori and Masaki
  Sato and Tatsuo Kato and Tadashi Sato},
  title = {The Design of the {PIMOS} File System},
  booktitle = {Proceedings of the International Conference on Fifth Generation
  Computer Systems},
  year = {1992},
  volume = {1},
  pages = {278--285},
  organization = {ICOT},
  keywords = {parallel file system, pario-bib},
  comment = {File system in the PIMOS operating system for the PIM (Parallel
  Inference Machine) in the Fifth Generation Computer Systems project in Japan.
  Paper design, no results yet. Uses disks that are attached directly to the
  computational processors. Significant in that it does use client caches in a
  parallel file system. Caches are kept coherent with a centralized
  directory-based protocol for exclusive-writer, multiple-reader semantics,
  supporting sequential consistency. Disk management includes logging to
  surivive crashes. Bitmap free list with buddy system to support full, 1/2,
  and 1/4 blocks. Trick to avoid constant update of on-disk free list. My
  suspicion is that cache coherence protocol may be expensive, especially in
  larger systems.}
}

@Article{jadav:evaluation,
  author = {Divyesh Jadav and Chutimet Srinilta and Alok Choudhary and P. Bruce
  Berra},
  title = {An Evaluation of Design Tradeoffs in a High Performance
  Media-on-Demand Server},
  journal = {Multimedia Systems},
  year = {1997},
  month = {January},
  volume = {5},
  number = {1},
  pages = {53--68},
  URL = {http://www.cat.syr.edu/~divyesh/MMACM.ps},
  keywords = {parallel I/O, I/O scheduling, multimedia, video on demand,
  pario-bib},
  abstract = {One of the key components of a multi-user multimedia-on-demand
  system is the data server. Digitalization of traditionally analog data such
  as video and audio, and the feasibility of obtaining network bandwidths above
  the gigabit-per-second range are two important advances that have made
  possible the realization, in the near future, of interactive distributed
  multimedia systems. Secondary-to-main memory I/O technology has not kept pace
  with advances in networking, main memory and CPU processing power.
  Consequently, the performance of the server has a direct bearing on the
  overall performance of such a system. \par In this paper we present a
  high-performance solution to the I/O retrieval problem in a distributed
  multimedia system. Parallelism of data retrieval is achieved by striping the
  data across multiple disks. We identify the different components that
  contribute to media data retrieval delay. The variable delays among these
  have a great bearing on the server throughput under varying load conditions.
  We present a buffering scheme to minimize these variations. We have
  implemented our model on the Intel Paragon parallel computer. The results of
  component-wise instrumentation of the server operation are presented and
  analyzed. We present experimental results that demonstrate the efficacy of
  the buffering scheme. Based on our experiments, a dynamic admission control
  policy that takes server workload into account is proposed.},
  comment = {Much more detailed than jadav:media-on-demand. Here they present
  less survey information, and all the details on their Paragon
  implementation/simulation. They experiment with many tradeoffs, and propose
  and evaluate several scheduling and admission-control algorithms.}
}

@InProceedings{jadav:ioschedule,
  author = {Divyesh Jadav and Chutimet Srinilta and Alok Choudhary and P. Bruce
  Berra},
  title = {Design and Evaluation of Data Access Strategies in a High
  Performance Multimedia-on-Demand Server},
  booktitle = {Proceedings of the Second IEEE International Conference on
  Multimedia Computing and Systems},
  year = {1995},
  month = {May},
  pages = {286--291},
  later = {jadav:j-ioschedule},
  URL = {http://www.cat.syr.edu/~divyesh/ICMCS95.ps},
  keywords = {parallel I/O, multimedia, pario-bib},
  abstract = {One of the key components of a multi user multimedia on demand
  system is the data server. Digitization of traditionally analog data such as
  video and audio, and the feasibility of obtaining network bandwidths above
  the gigabit per second range are two important advances that have made
  possible the realization, in the near future, of interactive distributed
  multimedia systems. Secondary-to-main memory I/O technology has not kept pace
  with advances in networking, main memory and CPU processing power.
  Consequently, the performance of the server has a direct bearing on the
  overall performance of such a system. We develop a model for the architecture
  of a server for such a system. Parallelism of data retrieval is achieved by
  striping the data across multiple disks. The performance of any server
  ultimately depends on the data access patterns. Two modifications of the
  basic retrieval algorithm are presented to exploit data access patterns in
  order to improve system throughput and response time. A complementary
  information caching optimization is discussed. Finally, we present
  performance results of these algorithms on the IBM SP1 and Intel Paragon
  parallel computers.},
  comment = {Journal version is jadav:j-ioschedule? See also
  jadav:media-on-demand. [Comments based on a much earlier version.] They
  propose I/O scheduling algorithms for multimedia file servers. They assume an
  MIMD architecture with no shared memory and with a disk on every node. One
  node is essentially a manager for new requests. Another set are interface
  nodes, each managing the data flow for a few multimedia data streams. The
  majority are server nodes, responsible just for fetching their data from disk
  and sending it to the interface nodes. The interface nodes assemble data from
  the server nodes into a data stream, and send it on out to the client. They
  describe algorithms for scheduling requests from the interface node to the
  server node, and for sending data out to the client. They also describe an
  algorithm for determining whether the system can accept a new request.}
}

@InProceedings{jadav:ioschedule2,
  author = {Divyesh Jadav and Chutimet Srinilta and Alok Choudhary},
  title = {{I/O} scheduling tradeoffs in a high performance media-on-demand
  server},
  booktitle = {Proceedings of the 1995 International Conference on High
  Performance Computing},
  year = {1995},
  month = {December},
  pages = {154--159},
  address = {New Delhi, India},
  later = {jadav:j-ioschedule},
  keywords = {multimedia, scheduling, parallel I/O, pario-bib},
  comment = {See also jadav:ioschedule, jadav:j-ioschedule.}
}

@Article{jadav:j-ioschedule,
  author = {Divyesh Jadav and Chutimet Srinilta and Alok Choudhary and P. Bruce
  Berra},
  title = {Techniques for Scheduling {I/O} in a High Performance
  Multimedia-On-Demand Server},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1996},
  month = {November},
  pages = {190--203},
  publisher = {Academic Press},
  earlier = {jadav:ioschedule},
  URL = {http://www.cat.syr.edu/~divyesh/JPDC.ps},
  keywords = {parallel I/O, I/O scheduling, multimedia, video on demand,
  pario-bib},
  comment = {Conference version is jadav:ioschedule; similar abstract. See
  jadav:media-on-demand.}
}

@Article{jadav:media,
  author = {D. Jadav and C. Srinilta and A. Choudhary},
  title = {Batching and dynamic allocation techniques for increasing the stream
  capacity of an on-demand media server},
  journal = {Parallel Computing},
  year = {1997},
  month = {December},
  volume = {23},
  number = {12},
  pages = {1727--1742},
  publisher = {Elsevier Science},
  URL = {http://dx.doi.org/10.1016/S0167-8191(97)00084-7},
  keywords = {multimedia, parallel I/O, pario-bib},
  abstract = {A server for an interactive distributed multimedia system may
  require thousands of gigabytes of storage space and high I/O bandwidth. In
  order to maximize system utilization, and thus minimize cost, the load must
  be balanced among the server's disks, interconnection network and scheduler.
  Many algorithms for maximizing retrieval capacity from the storage system
  have been proposed. This paper presents techniques for improving server
  capacity by assigning media requests to the nodes of a server so as to
  balance the load on the interconnection network and the scheduling nodes.
  Five policies for dynamic request assignment are developed. An important
  factor that affects data retrieval in a high-performance continuous media
  server is the degree of parallelism of data retrieval. The performance of the
  dynamic policies on an implementation of a server model developed earlier is
  presented for two values of the degree of parallelism.}
}

@Article{jadav:media-on-demand,
  author = {Divyesh Jadav and Alok Choudhary},
  title = {Designing and Implementing High Performance Media-on-Demand
  Servers},
  journal = {IEEE Parallel and Distributed Technology},
  year = {1995},
  month = {Summer},
  volume = {3},
  number = {2},
  pages = {29--39},
  publisher = {IEEE Computer Society Press},
  URL = {http://www.cat.syr.edu/~divyesh/PDT.ps},
  keywords = {parallel I/O, multimedia, video on demand, pario-bib},
  abstract = {This paper discusses the architectural requirements of a
  multimedia-on-demand system, with special emphasis on the media server.
  Although high-performance computers are the best choice for building
  media-on-demand servers, implementation poses many difficulties. We conclude
  with a discussion of the open issues regarding the design and implementation
  of the server.},
  comment = {A survery of the issues involved in designing a media-on-demand
  server (they really focus on temporal data like video and audio). They do
  have a few results comparing various granularities for disk-requests and
  network messages, which seem to be from an Intel Paragon implementation,
  although they do not describe the experimental setup. See jadav:evaluation,
  jadav:j-ioschedule, jadav:ioschedule.}
}

@Article{jain:airdisks,
  author = {Ravi Jain and John Werth},
  title = {Airdisks and AirRAID: Modeling and scheduling periodic wireless
  data},
  journal = {Computer architecture news},
  year = {1995},
  month = {September},
  volume = {23},
  number = {4},
  pages = {23--28},
  keywords = {wireless communication, mobile computing, RAID, parallel I/O,
  pario-bib},
  comment = {They discuss the idea of broadcasting a disk's data over the air,
  so PDAs can 'read' the disk by waiting for the necessary data to come along.
  Good for read-only or write-rarely disks. They discuss the idea of dividing
  the air into multiple (frequency or time) tracks and 'striping' data across
  the tracks for better bandwidth and reliability.}
}

@Article{jain:jschedule,
  author = {Ravi Jain and Kiran Somalwar and John Werth and J.~C. Browne},
  title = {Heuristics for Scheduling {I/O} Operations},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {1997},
  month = {March},
  volume = {8},
  number = {3},
  pages = {310--320},
  URL = {http://www.computer.org/tpds/td1997/l0310abs.htm},
  keywords = {network, graph coloring, multiprocessor file system, resource
  allocation, scheduling, parallel I/O, pario-bib},
  abstract = {The I/O bottleneck in parallel computer systems has recently
  begun receiving increasing interest. Most attention has focused on improving
  the performance of I/O devices using fairly low-level parallelism in
  techniques such as disk striping and interleaving. Widely applicable
  solutions, however, will require an integrated approach which addresses the
  problem at multiple system levels, including applications, systems software,
  and architecture. We propose that within the context of such an integrated
  approach, scheduling parallel I/O operations will become increasingly
  attractive and can potentially provide substantial performance benefits. \par
  We describe a simple I/O scheduling problem and present approximate
  algorithms for its solution. The costs of using these algorithms in terms of
  execution time, and the benefits in terms of reduced time to complete a batch
  of I/O operations, are compared with the situations in which no scheduling is
  used, and in which an optimal scheduling algorithm is used. The comparison is
  performed both theoretically and experimentally. We have found that, in
  exchange for a small execution time overhead, the approximate scheduling
  algorithms can provide substantial improvements in I/O completion times.},
  comment = {See also jain:pario}
}

@Article{jain:pario,
  author = {Ravi Jain and Kiran Somalwar and John Werth and J. C. Browne},
  title = {Scheduling Parallel {I/O} Operations in Multiple Bus Systems},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1992},
  month = {December},
  volume = {16},
  number = {4},
  pages = {353--362},
  publisher = {Academic Press},
  keywords = {parallel I/O, shared memory, scheduling, pario-bib},
  comment = {An algorithm to schedule (off-line) a set of transfers between P
  procs and D disks, such that no proc or disk does more than one request at a
  time, and no more than K transfers are concurrent (due to channel limits),
  with integer arbitrary-length transfers that are preemptable (ie
  segmentable). Much faster than previous algorithms. Problems, IMHO: off-line
  is only good for batch executions with known needs (ok for big collective
  I/Os I suppose). All k channels are usable by all proc-disk pairs, may not be
  realistic. No accomodation for big difference in disk and channel time, ie,
  disk probably can't do a channel transfer every time unit. Allows transfers
  in any order, which means disk seeks could be bad. No cost for preemption of
  a transfer, which could mean more message overhead if more messages are
  needed to do a given transfer. Assumes all transfers have predictable time.
  Still, it could be useful in some situations, esp. where order really doesn't
  matter.}
}

@InCollection{jain:pario-intro,
  author = {Ravi Jain and John Werth and J.~C. Browne},
  title = {{I/O} in Parallel and Distributed Systems: An Introduction},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {1},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {3--30},
  publisher = {Kluwer Academic Publishers},
  keywords = {parallel I/O, out-of-core, pario-bib},
  abstract = {We sketch the reasons for the I/O bottleneck in parallel and
  distributed systems, pointing out that it can be viewed as a special case of
  a general bottleneck that arises at all levels of the memory hierarchy. We
  argue that because of its severity, the I/O bottleneck deserves systematic
  attention at all levels of system design. We then present a survey of the
  issues raised by the I/O bottleneck in five key areas of parallel and
  distributed systems: applications, algorithms, compilers, operating systems
  and architecture. Finally, we address some of the trends we observe emerging
  in new paradigms of parallel and distributed computing: the convergence of
  networking and I/O, I/O for massively distributed ``global information
  systems'' such as the World Wide Web, and I/O for mobile computing and
  wireless communications. These considerations suggest exciting new research
  directions in I/O for parallel and distributed systems in the years to
  come.},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@PhdThesis{jensen:thesis,
  author = {David Wayne Jensen},
  title = {Disk {I/O} In High-Performance Computing Systems},
  year = {1993},
  school = {Univ. Illinois, Urbana-Champagne},
  keywords = {parallel I/O, file access pattern, multiprocessor file system,
  pario-bib},
  comment = {He looks at the effect of I/O traffic on memory access in a
  multistage network, and custom mappings of file data to disks to support
  non-sequential I/O. He considers both the traditional ``multiuser'' workload
  and the case where a application accesses a single file in parallel. Assumes
  a dance-hall shared-memory MIMD base architecture (CEDAR). Disks are attached
  either to the memory or processor side of the network, and in either case
  require four network traversals per read/write operation. Nice summary of
  previous parallel I/O architectures, and characterization of the workload.
  Main conclusions: the network is not an inherent bottleneck, but I/O traffic
  can cause up to 50\% loss in memory traffic bandwidth, and bursts of I/O can
  saturate the network. For a high I/O request rate (eg, all procs active),
  spread each request over a small number of disks (eg, one), whereas for a low
  I/O request rate (eg, one proc active) spread each request over lots of disks
  (eg, all). This avoids cache thrashing when multiple procs hit on one disk
  node. However, if they are all reading the same data, then there is no cache
  thrashing and you want to maximize parallelism across disks. When accessing
  disjoint parts of the same file, it is sometimes better to have one proc do
  all the accesses, because this avoids out-of-order requests that spoil
  prefetching, and it avoids contention by multiple procs. No single
  file-to-disk mapping worked for everything; interleaved (striped) worked well
  for most sequential patterns, but ``sequential'' (partitioned) mappings
  worked better for multiple-process loads that tend to focus each process on a
  disk, eg, an interleaved pattern where the stride is equal to the number of
  disks. Thus, if your pattern can get you disk locality, use a mapping that
  will provide it.}
}

@Article{jeong:inverted,
  author = {Byeong-Soo Jeong and Edward Omiecinski},
  title = {Inverted File Partitioning Schemes in Multiple Disk Systems},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {1995},
  month = {April},
  volume = {6},
  number = {2},
  pages = {142--153},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, pario-bib},
  comment = {Ways to distribute data across multiple disks to speed information
  retrieval, given an inverted index. Based on a shared-everything
  multiprocessor model.}
}

@Book{jin:io-book,
  title = {High Performance Mass Storage and Parallel {I/O}: Technologies and
  Applications},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {RAID, disk array, parallel file system, caching, prefetching,
  multiprocessor file system, parallel I/O applications, parallel I/O,
  pario-bib},
  comment = {An excellent collection of papers that were mostly published
  earlier.}
}

@Article{jin:striping,
  author = {H. Jin and K. Hwang},
  title = {Optimal striping in {RAID} architecture},
  journal = {Concurrency--- Practice and Experience},
  year = {2000},
  month = {August},
  volume = {12},
  number = {10},
  pages = {909--916},
  URL =
  {http://www3.interscience.wiley.com/cgi-bin/fulltext?ID=73000569&PLACEBO=IE.pdf},
  keywords = {parallel I/O, RAID, disk striping, pario-bib},
  abstract = {To access a RAID (redundant arrays of inexpensive disks), the
  disk stripe size greatly affects the performance of the disk array. In this
  article, we present a performance model to analyze the effects of striping
  with different stripe sizes in a RAID. The model can be applied to optimize
  the stripe size. Compared with previous approaches, our model is simpler to
  apply and more accurately reveals the real performance. Both system designers
  and users can apply the model to support parallel I/O events}
}

@InProceedings{johnson:insertions,
  author = {Theodore Johnson},
  title = {Supporting Insertions and Deletions in Striped Parallel
  Filesystems},
  booktitle = {Proceedings of the Seventh International Parallel Processing
  Symposium},
  year = {1993},
  pages = {425--433},
  publisher = {IEEE Computer Society Press},
  address = {Newport Beach, CA},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {If you insert blocks into a striped file, you mess up the nice
  striping. So he breaks the file into striped extents, and keeps track of the
  extents with a distributed B-tree index. Deletions also fit into the same
  scheme.}
}

@InProceedings{johnson:scx,
  author = {Steve Johnson and Steve Scott},
  title = {A Supercomputer System Interconnect and Scalable {IOS}},
  booktitle = {Proceedings of the Fourteenth IEEE Symposium on Mass Storage
  Systems},
  year = {1995},
  month = {September},
  pages = {357--367},
  publisher = {IEEE Computer Society Press},
  URL = {http://www.computer.org/conferen/mss95/johnson/johnson.htm},
  keywords = {mass storage, I/O architecture, I/O interconnect, supercomputer,
  parallel I/O, pario-bib},
  abstract = {The evolution of system architectures and system configurations
  has created the need for a new supercomputer system interconnect. Attributes
  required of the new interconnect include commonality among system and
  subsystem types, scalability, low latency, high bandwidth, a high level of
  resiliency, and flexibility. Cray Research Inc. is developing a new system
  channel to meet these interconnect requirements in future systems. The
  channel has a ring-based architecture, but can also function as a
  point-to-point link. It integrates control and data on a single, physical
  path while providing low latency and variance for control messages. Extensive
  features for client isolation, diagnostic capabilities, and fault tolerance
  have been incorporated into the design. The attributes and features of this
  channel are discussed along with implementation and protocol specifics.},
  comment = {About the Cray Research SCX channel, capable of 1200 MB/s peak and
  900 MB/s delivered throughput.}
}

@Article{johnson:wave,
  author = {Olin G. Johnson},
  title = {Three-dimensional Wave Equation Computations on Vector Computers},
  journal = {Proceedings of the IEEE},
  year = {1984},
  month = {January},
  volume = {72},
  number = {1},
  pages = {90--95},
  keywords = {computational physics, parallel I/O, pario-bib},
  comment = {Old paper on the need for large memory and fast paging and I/O in
  out-of-core solutions to 3-d seismic modeling. They used 4-way parallel I/O
  to support their job. Needed to transfer a 3-d matrix in and out of memory by
  rows, columns, and vertical columns. Stored in block-structured form to
  improve locality on the disk.}
}

@InProceedings{jones:mpi-io,
  author = {Terry Jones and Richard Mark and Jeanne Martin and John May and
  Elsie Pierce and Linda Stanberry},
  title = {An {MPI-IO} Interface to {HPSS}},
  booktitle = {Proceedings of the Fifth NASA Goddard Conference on Mass Storage
  Systems and Technologies},
  year = {1996},
  month = {September},
  pages = {I:37--50},
  keywords = {mass storage, parallel I/O, multiprocessor file system interface,
  pario-bib}
}

@Article{jones:skyhi,
  author = {Philip W. Jones and Christopher L. Kerr and Richard S. Hemler},
  title = {Practical considerations in development of a parallel {SKYHI}
  general circulation model},
  journal = {Parallel Computing},
  year = {1995},
  volume = {21},
  pages = {1677--1694},
  publisher = {North-Holland (Elsevier Scientific)},
  keywords = {parallel computing, scientific computing, weather prediction,
  global climate model, parallel I/O, pario-bib},
  comment = {They talk about a weather code. There's a bit about the parallel
  I/O issues. They periodically write a restart file, and they write out
  several types of data files. They write out the data in any order, with a
  little mini-header in each chunk that describes the chunk. I/O was not a
  significant percentage of their run time on either the CM5 or C90. See
  hammond:atmosphere and hack:ncar in the same issue.}
}

@Article{kallahalla:buffer-management,
  author = {M. Kallahalla and P. J. Varman},
  title = {Analysis of simple randomized buffer management for parallel I/O},
  journal = {Information Processing Letters},
  year = {2004},
  month = {April},
  volume = {90},
  number = {1},
  pages = {47--52},
  institution = {Hewlett Packard Labs, 1501 Page Mill Rd, Palo Alto, CA 94304
  USA; Hewlett Packard Labs, Palo Alto, CA 94304 USA; Rice Univ, Dept ECE,
  Houston, TX 77005 USA},
  publisher = {Elsevier Science},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL = {http://dx.doi.org/10.1016/j.ipl.2004.01.009},
  keywords = {parallel I/O, prefetching, data placement, caching, buffer
  management, analysis, algorithms, randomization, pario-bib},
  abstract = {Buffer management for a D-disk parallel I/O system is considered
  in the context of randomized placement of data on the disks. A simple
  prefetching and caching algorithm PHASE-LRU using bounded lookahead is
  described and analyzed. It is shown that PHASE-LRU performs an expected
  number of I/Os that is within a factor Theta(log D/log log D) of the number
  performed by an optimal off-line algorithm. In contrast, any deterministic
  buffer management algorithm with the same amount of lookahead must do at
  least Omega(rootD) times the number of I/Os of the optimal. (C) 2004 Elsevier
  B.V. All rights reserved.}
}

@Article{kallahalla:pc-opt,
  author = {M. Kallahalla and P.~J. Varman},
  title = {{PC-OPT}: Optimal Offline Prefetching and Caching for Parallel {I/O}
  Systems},
  journal = {IEEE Transactions on Computers},
  year = {2002},
  month = {November},
  volume = {51},
  number = {11},
  pages = {1333--1344},
  publisher = {IEEE Computer Society Press},
  URL = {http://www.computer.org/tc/tc2002/tytoc.htm},
  keywords = {parallel I/O, file prefetching, pario-bib},
  abstract = {We address the problem of prefetching and caching in a parallel
  I/O system and present a new algorithm for parallel disk scheduling.
  Traditional buffer management algorithms that minimize the number of block
  misses are substantially suboptimal in a parallel I/O system where multiple
  I/Os can proceed simultaneously. We show that in the offline case, where a
  priori knowledge of all the requests is available, PC-OPT performs the
  minimum number of I/Os to service the given I/O requests. This is the first
  parallel I/O scheduling algorithm that is provably offline optimal in the
  parallel disk model. In the online case, we study the context of global
  L-block lookahead, which gives the buffer management algorithm a lookahead
  consisting of L distinct requests. We show that the competitive ratio of
  PC-OPT, with global L-block lookahead, is \Theta (M - L + D), when L < M, and
  \Theta (M D / L), when L > M, where the number of disks is D and buffer size
  is M.}
}

@InProceedings{kallahalla:prefetch,
  author = {Mahesh Kallahalla and Peter J. Varman},
  title = {Optimal Prefetching and Caching for Parallel {I/O} Systems},
  booktitle = {Proceedings of the Thirteenth Symposium on Parallel Algorithms
  and Architectures},
  year = {2001},
  month = {July},
  pages = {219--228},
  publisher = {ACM Press},
  note = {To appear},
  URL = {http://www.ece.rice.edu/~pjv/spaa.ps},
  keywords = {parallel I/O, prefetch, disk cache, pario-bib},
  abstract = {We address the problem of prefetching and caching in a parallel
  I/O system and present a new algorithm for optimal parallel-disk scheduling.
  Traditional buffer management algorithms that minimize the number of I/O disk
  accesses, are substantially suboptimal in a parallel I/O system where
  multiple I/Os can proceed simultaneously. \par We present a new algorithm
  Super for parallel-disk I/O scheduling. We show that in the off-line case,
  where a priori knowledge of all the requests is available, Super performs the
  minimum number of I/Os to service the given I/O requests. This is the first
  parallel I/O scheduling algorithm that is provably offline optimal. In the
  on-line case, we study Super in the context of global L-block lookahead,
  which gives the buffer management algorithm a lookahead consisting of L
  distinct requests. We show that the competitive ratio of Super, with global
  L-block lookahead, is Theta(M-L+D), when L < M, and Theta(MD/L), when L >= M,
  where the number of disks is D and buffer size is M.}
}

@InProceedings{kallahalla:read-once,
  author = {Mahesh Kallahalla and Peter J. Varman},
  title = {Optimal Read-Once Parallel Disk Scheduling},
  booktitle = {Proceedings of the Sixth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1999},
  month = {May},
  pages = {68--77},
  publisher = {ACM Press},
  address = {Atlanta, GA},
  URL = {http://vibes.cs.uiuc.edu/IOPADS/Accepted/Kallahalla.ps},
  keywords = {disk scheduling, parallel I/O, pario-bib},
  abstract = {We present an optimal algorithm, L-OPT, for prefetching and I/O
  scheduling in parallel I/O systems using a read-once model of block
  reference. The algorithm uses knowledge of the next L block references,
  L-block lookahead, to schedule I/Os in an on-line manner. It uses a dynamic
  priority assignment scheme to decide when blocks should be prefetched, so as
  to minimize the total number of I/Os. The parallel disk model of an I/O
  system is used to study the performance of L-OPT. We show that L-OPT is
  comparable to the best on-line algorithm with the same amount of lookahead;
  the ratio of the length of its schedule to the length of the optimal schedule
  is within a constant factor of the best possible. Specifically, we show that
  the competitive ratio of L-OPT is $\Omega(\sqrt{MD/L})$ which matches the
  lower bound on the competitive ratio of any prefetching algorithm with
  L-block lookahead. In addition we show that when the lookahead consists of
  the entire reference string, L-OPT performs the minimum possible number of
  I/Os; hence L-OPT is the optimal off-line algorithm. Finally, using synthetic
  traces we empirically study the performance characteristics of L-OPT.}
}

@InProceedings{kalns:video,
  author = {Edgar T. Kalns and Yarsun Hsu},
  title = {Video on Demand Using the {Vesta} Parallel File System},
  booktitle = {Proceedings of the IPPS~'95 Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1995},
  month = {April},
  pages = {30--46},
  later = {kalns:video-book},
  keywords = {parallel I/O, multimedia, multiprocessor file system, pario-bib},
  comment = {Hook a video-display system to the compute node of an SP-1 running
  Vesta, and then use Vesta file system to serve the video.}
}

@InCollection{kalns:video-book,
  author = {Edgar T. Kalns and Yarsun Hsu},
  title = {Video on Demand Using the {Vesta} Parallel File System},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {8},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {187--204},
  publisher = {Kluwer Academic Publishers},
  earlier = {kalns:video},
  keywords = {parallel I/O, parallel file system, video on demand, multimedia,
  pario-bib},
  abstract = {Video on Demand (VoD) servers are expected to serve hundreds of
  customers with as many, or more, movie videos. Such an environment requires
  large storage capacity and real-time, high-bandwidth transmission
  capabilities. Massive striping of videos across disk arrays is a viable means
  to store large amounts of video data and, through parallelism of file access,
  achieve the needed bandwidth. The Vesta Parallel File System facilitates
  parallel access from an application to files distributed across a set of I/O
  processors, each with a set of attached disks. Given Vesta's parallel file
  access capabilities, this paper examines a number of issues pertaining to the
  implementation of VoD services on top of Vesta. We develop a prototype VoD
  experimentation environment on an IBM SP-1 and analyze Vesta's performance in
  video data retrieval for real-time playback. Specifically, we explore the
  impact of concurrent video streams competing for I/O node resources, cache
  effects, and video striping across multiple I/O nodes.},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@InProceedings{kamalvanshi:pdfs,
  author = {Ajay Kamalvanshi and S. K. Ghoshal and R. C. Hansdah},
  title = {Design, implementation, and performance evaluation of a parallel
  distributed file system},
  booktitle = {Proceedings of the 1995 International Conference on High
  Performance Computing},
  year = {1995},
  month = {December},
  pages = {125--129},
  address = {New Delhi, India},
  keywords = {parallel file system, parallel I/O, pario-bib}
}

@Article{kandaswamy:evaluation,
  author = {Meenakshi A. Kandaswamy and Mahmut Kandemir and Alok Choudhary and
  David Bernholdt},
  title = {An Experimental Evaluation of {I/O} Optimizations on Different
  Applications},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {2002},
  month = {July},
  volume = {13},
  number = {7},
  pages = {728--744},
  publisher = {IEEE Computer Society Press},
  URL = {http://ieeexplore.ieee.org/xpl/tocresult.jsp?isNumber=21941},
  keywords = {parallel I/O, parallel application, pario-bib},
  abstract = {any large-scale applications have significant I/O requirements as
  well as computational and memory requirements. Unfortunately, the limited
  number of I/O nodes provided in a typical configuration of the modern
  message-passing distributed-memory architectures such as the Intel Paragon
  and the IBM SP-2 limits the I/O performance of these applications severely.
  In this paper, we examine some software optimization techniques and evaluate
  their effects in five different I/O-intensive codes from both small and large
  application domains. Our goals in this study are twofold. First, we want to
  understand the behavior of large-scale data-intensive applications and the
  impact of I/O subsystems on their performance and vice versa. Second, and
  more importantly, we strive to determine the solutions for improving the
  applications' performance by a mix of software techniques. Our results reveal
  that different applications can benefit from different optimizations. For
  example, we found that some applications benefit from file layout
  optimizations, whereas others take advantage of collective I/O. A combination
  of architectural and software solutions is normally needed to obtain good I/O
  performance. For example, we show that with a limited number of I/O
  resources, it is possible to obtain good performance by using appropriate
  software optimizations. We also show that beyond a certain level, imbalance
  in the architecture results in performance degradation even when using
  optimized software, thereby indicating the necessity of an increase in I/O
  resources.}
}

@InProceedings{kandaswamy:hartree,
  author = {Meenakshi A. Kandaswamy and Mahmut T. Kandemir and Alok N.
  Choudhary and David E. Bernholdt},
  title = {Optimization and Evaluation of {Hartree-Fock} Application's {I/O}
  with {PASSION}},
  booktitle = {Proceedings of SC97: High Performance Networking and Computing},
  year = {1997},
  month = {November},
  publisher = {ACM Press},
  address = {San Jose},
  later = {kandaswamy:hartree-fock},
  URL = {http://doi.acm.org/10.1145/509593.509624},
  keywords = {parallel I/O, scientific computing, pario-bib},
  abstract = {Parallel machines are an important part of the scientific
  application developer's tool box and the computational and processing demands
  placed on these machines are rapidly increasing. Many scientific applications
  tend to perform high volume data storage, data retrieval and data processing,
  which demands high performance from the I/O subsystem. In this paper, we
  conduct an experimental study of the I/O performed by the Hartree-Fock (HF)
  method, as implemented using a fully distributed data approach in the NWChem
  parallel computational chemistry package. We use PASSION, a parallel and
  scalable I/O library and it's optmizations such as prefetching to improve the
  I/O performance of the HF application and we also present extensive
  experimental results of the same. The effects of both application-related
  factors and system-related factors on the application's I/O performance are
  studied in detail. We rank the optimizations based on the significance and
  impact on the performance of HF's I/O phase as: I. efficient interface to the
  file system, II. prefetching optimization, and III. buffering. The results
  show that within the limits of our experimental parameters,
  application-related factors are more effective on the overall I/O behavior of
  this application. We obtained up to 95\% improvement in I/O time and 43\%
  improvement in the overall application performance with these
  optimizations.},
  comment = {No page numbers: proceedings on CDROM and web only.}
}

@Article{kandaswamy:hartree-fock,
  author = {Meenakshi Kandaswamy and Mahmut Kandemir and Alok Choudhary and
  David Bernholdt},
  title = {An Experimental Study to Analyze and Optimize Hartree-Fock
  Application's {I/O} with {PASSION}},
  journal = {The International Journal of High Performance Computing
  Applications},
  year = {1998},
  month = {Winter},
  volume = {12},
  number = {4},
  pages = {411--439},
  note = {In a Special Issue on I/O in Parallel Applications},
  earlier = {kandaswamy:hartree},
  keywords = {parallel I/O application, pario-bib},
  abstract = {Many scientific applications tend to perform high volume data
  storage, data retrieval and data processing, which demands high performance
  from the I/O subsystem. The focus and contribution of this work is to study
  the I/O behavior of the Hartree-Fock method using PASSION. HF's I/O phases
  can contribute up to 62.34\% of the total execution time. We reduce the
  execution time and I/O time up to 54\% and 6\% respectively of that of the
  original case through PASSION and its optimizations. Additionally, we
  categorize the factors that affect the I/O performance of HF into key
  application-related parameters and key system-related parameters. Based on
  extensive empirical results and within our experimental space, we order the
  parameters according to their impact on HF's I/O performance as follows:
  efficient interface, prefetching, buffering, number of I/O nodes, striping
  factor and striping unit. We conclude that application-related factors have a
  more significant effect on HF's I/O performance than the system-related
  factors within our experimental space.},
  comment = {In a Special Issue on I/O in Parallel Applications, volume 12,
  numbers 3 and 4.}
}

@Article{kandemir:compiler,
  author = {Mahmut Kandemir},
  title = {Compiler-Directed Collective {I/O}},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {2001},
  month = {December},
  volume = {12},
  number = {12},
  pages = {1318--1331},
  URL = {http://computer.org/tpds/td2001/l1318abs.htm},
  keywords = {parallel I/O, collective I/O, compiler, pario-bib},
  abstract = {Current approaches to parallel I/O demand extensive user effort
  to obtain acceptable performance. This is in part due to difficulties in
  understanding the characteristics of a wide variety of I/O devices and in
  part due to inherent complexity of I/O software. While parallel I/O systems
  provide users with environments where persistent data sets can be shared
  between parallel processors, the ultimate performance of I/O-intensive codes
  depends largely on the relation between data access patterns exhibited by
  parallel processors and storage patterns of data in files and on disks. In
  cases where access patterns and storage patterns match, we can exploit
  parallel I/O hardware by allowing each processor to perform independent
  parallel I/O. In order to keep performance decent under circumstances in
  which data access patterns and storage patterns do not match, several I/O
  optimization techniques have been developed in recent years. Collective I/O
  is such an optimization technique that enables each processor to do I/O on
  behalf of other processors if doing so improves the overall performance.
  While it is generally accepted that collective I/O and its variants can bring
  impressive improvements as far as the I/O performance is concerned, it is
  difficult for the programmer to use collective I/O in an optimal manner. In
  this paper, we propose and evaluate a compiler-directed collective I/O
  approach which detects the opportunities for collective I/O and inserts the
  necessary I/O calls in the code automatically. An important characteristic of
  the approach is that instead of applying collective I/O indiscriminately, it
  uses collective I/O selectively only in cases where independent parallel I/O
  would not be possible or would lead to an excessive number of I/O calls. The
  approach involves compiler-directed access pattern and storage pattern
  detection schemes that work on a multiple application environment. We have
  implemented the necessary algorithms in a source-to-source translator and
  within a stand-alone tool. Our experimental results on an SGI/Cray Origin
  2000 multiprocessor machine demonstrate that our compiler-directed collective
  I/O scheme performs very well on different setups built using nine
  applications from several scientific benchmarks. We have also observed that
  the I/O performance of our approach is only 5.23 percent worse than an
  optimal scheme.}
}

@InProceedings{kandemir:io-optimize,
  author = {Mahmut Kandemir and Alok Choudhary and Rajesh Bordawekar},
  title = {{I/O} Optimizations for Compiling Out-of-Core programs on
  Distributed-Memory Machines},
  booktitle = {Proceedings of the Eighth SIAM Conference on Parallel Processing
  for Scientific Computing},
  year = {1997},
  month = {March},
  pages = {8--9},
  publisher = {Society for Industrial and Applied Mathematics},
  note = {To appear. Extended Abstract.},
  URL = {http://www.cacr.caltech.edu/~rajesh/siam97.ps},
  keywords = {parallel I/O, compiler, out-of-core, pario-bib},
  abstract = {Since many of large scale computational problems usually deal
  with large quantities of data, optimizing the performance of I/O subsystems
  of massively parallel machines is an important challenge for system
  designers. We describe data access reorganization strategies for efficient
  compilation of out-of-core data-parallel programs on distributed memory
  machines. Our analytical approach and experimental results indicate that the
  optimizations introduced in this paper can reduce the amount of time spent in
  I/O by as much as an order of magnitude on both uniprocessors and
  multicomputers.}
}

@InProceedings{kandemir:locality,
  author = {M. Kandemir and A. Choudhary and J. Ramanujam and M. Kandaswamy},
  title = {A Unified Compiler Algorithm for Optimizing Locality, Parallelism,
  and Communication in Out-of-Core Computations},
  booktitle = {Proceedings of the Fifth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1997},
  month = {November},
  pages = {79--92},
  publisher = {ACM Press},
  address = {San Jose, CA},
  URL = {http://doi.acm.org/10.1145/266220.266228},
  keywords = {compiler, out of core, parallel I/O, pario-bib},
  abstract = {This paper presents compiler algorithms to optimize out-of-core
  programs. These algorithms consider loop and data layout transformations in a
  unified framework. The performance of an out-of-core loop nest containing
  many references can be improved by a combination of restructuring the loops
  and file layouts. This approach considers array references one-by-one and
  attempts to optimize each reference for parallelism and locality. When there
  are references for which parallelism optimizations do not work, communication
  is vectorized so that data transfer can be performed before the innermost
  tiling loop. Preliminary results from hand-compiles on IBM SP-2 and Intel
  Paragon show that this approach reduces the execution time, improves the
  bandwidth speedup and overall speedup. In addition, we extend the base
  algorithm to work with file layout constraints and show how it can be used
  for optimizing programs consisting of multiple loop nests.}
}

@Article{kandemir:ooc,
  author = {M. Kandemir and A. Choudhary and J. Ramanujam and R. Bordawekar},
  title = {Compilation techniques for out-of-core parallel computations},
  journal = {Parallel Computing},
  year = {1998},
  month = {May},
  volume = {24},
  number = {3},
  pages = {597--628},
  keywords = {parallel I/O, compiler, out of core, pario-bib}
}

@Article{kandemir:optimizations,
  author = {M. T. Kandemir},
  title = {Compiler-directed Optimizations for Improving the Performance of
  {I/O}-Intensive Applications},
  journal = {International Journal of Parallel and Distributed Systems and
  Networks},
  year = {2002},
  volume = {5},
  number = {2},
  pages = {52--65},
  publisher = {Acta Press},
  URL = {http://www.actapress.com/journals/toc/toc2042002.htm#2002vol5issue2},
  keywords = {parallel I/O, compiler, pario-bib}
}

@InProceedings{kandemir:optimize,
  author = {Mahmut Kandemir and Alok Choudhary and J. Ramanujam and Rajesh
  Bordawekar},
  title = {Optimizing Out-of-Core Computations in Uniprocessors},
  booktitle = {Proceedings of the Workshop on Interaction between Compilers and
  Computer Architectures},
  year = {1997},
  month = {February},
  pages = {1--10},
  publisher = {Kluwer Academic Publishers},
  URL = {http://www.cacr.caltech.edu/~rajesh/hpca.ps},
  keywords = {parallel I/O, compiler, out-of-core, pario-bib},
  abstract = {Programs accessing disk-resident arrays perform poorly in general
  due to excessive number of I/O calls and insufficient help from compilers. In
  this paper, in order to alleviate this problem, we propose a series of
  compiler optimizations. Both the analytical approach we use and the
  experimental results provide strong evidence that our method is very
  effective on uniprocessors for out-of-core nests whose data sizes far exceed
  the size of available memory.}
}

@Article{kandemir:optimizing,
  author = {M. Kandemir and A. Choudhary and J. Ramanujam and R. Bordawekar},
  title = {Optimizing Out-of-core Computations in Uniprocessors},
  journal = {Newsletter of the Technical Committee on Computer Architecture
  (TCCA)},
  year = {1997},
  month = {June},
  pages = {25--27},
  keywords = {out of core, parallel I/O, pario-bib}
}

@InProceedings{kandemir:reorganize,
  author = {Mahmut Kandemir and Rajesh Bordawekar and Alok Choudhary},
  title = {Data Access Reorganizations in Compiling Out-of-Core Data Parallel
  Programs on Distributed Memory Machines},
  booktitle = {Proceedings of the Eleventh International Parallel Processing
  Symposium},
  year = {1997},
  month = {April},
  pages = {559--564},
  URL = {http://www.cacr.caltech.edu/~rajesh/ipps97.ps},
  keywords = {compiler, data-parallel, out-of-core, parallel I/O, pario-bib},
  abstract = {This paper describes optimization techniques for translating
  out-of-core programs written in a data parallel language to message passing
  node programs with explicit parallel I/O. We demonstrate that straightforward
  extension of in-core compilation techniques does not work well for
  out-of-core programs. We then describe how the compiler can optimize the code
  by (1) determining appropriate file layouts for out-of-core arrays, (2)
  permuting the loops in the nest(s) to allow efficient file access, and (3)
  partitioning the available node memory among references based on I/O cost
  estimation. Our experimental results indicate that these optimizations can
  reduce the amount of time spent in I/O by as much as an order of magnitude.}
}

@InProceedings{kandemir:tiling,
  author = {Mahmut Kandemir and Rajesh Bordawekar and Alok Choudhary and J.
  Ramanujam},
  title = {A Unified Tiling Approach for Out-of-Core Computations},
  booktitle = {Sixth Workshop on Compilers for Parallel Computers},
  year = {1996},
  month = {December},
  pages = {323--334},
  publisher = {Forschungzentrum Julich GmbH},
  address = {Aachen, Germany},
  note = {Also available as Caltech Technical Report CACR 130},
  URL = {http://www.cacr.caltech.edu/~rajesh/cpc.ps},
  keywords = {parallel I/O, compiler, out-of-core, pario-bib},
  abstract = {This paper describes a framework by which an out-of-core stencil
  program written in a data-parallel language can be translated into node
  programs in a distributed-memory message-passing machine with explicit I/O
  and communication. We focus on a technique called \emph{Data Space Tiling} to
  group data elements into slabs that can fit into memories of processors.
  Methods to choose \emph{legal} tile shapes under several constraints and
  deadlock-free scheduling of tiles are investigated. Our approach is
  \emph{unified} in the sense that it can be applied to both FORALL loops and
  the loops that involve flow-dependences.}
}

@Article{karges:par-pipe,
  author = {Jonathan Karges and Otto Ritter and S\'andor Suhai},
  title = {Design and Implementation of a Parallel Pipe},
  journal = {ACM Operating Systems Review},
  year = {1997},
  month = {April},
  volume = {31},
  number = {2},
  pages = {60--94},
  keywords = {interprocess communication, parallel I/O, pario-bib},
  comment = {A parallel version of the Unix 'pipe' feature, for connecting the
  output of one program to multiple other programs or files. Implemented on
  Solaris. Performance results.}
}

@TechReport{karpovich:bottleneck,
  author = {John F. Karpovich and Andrew S. Grimshaw and James C. French},
  title = {Breaking the {I/O} Bottleneck at the {National Radio Astronomy
  Observatory (NRAO)}},
  year = {1993},
  month = {August},
  number = {CS-94-37},
  institution = {University of Virginia},
  URL = {ftp://ftp.cs.virginia.edu/pub/techreports/CS-94-37.ps.Z},
  keywords = {scientific database, parallel I/O, pario-bib},
  comment = {See also karpovich:case-study. That is a subset of this paper.}
}

@InProceedings{karpovich:case-study,
  author = {John F. Karpovich and James C. French and Andrew S. Grimshaw},
  title = {High Performance Access to Radio Astronomy Data: A Case Study},
  booktitle = {Proceedings of the 7th International Working Conference on
  Scientific and Statistical Database Management},
  year = {1994},
  month = {September},
  note = {Also available as Univ. of Virginia TR CS-94-25},
  URL = {ftp://ftp.cs.virginia.edu/pub/techreports/CS-94-25.ps.Z},
  keywords = {scientific database, parallel I/O, pario-bib},
  comment = {Apparently a subset of karpovich:bottleneck. They store a sparse,
  multidimensional data set (radio astronomy data) as a set of tagged data
  values, ie, as a set of tuples, each with several keys and a data value. They
  use a PLOP format to partition each dimension into slices, so that each
  intersection of the slices forms a bucket. They decide on the splits based on
  a preliminary statistical survey of the data. Bucket overflow is handled by
  chaining. Then, they evaluate various kinds of queries, ie, multidimensional
  range queries, for their performance. In this workload queries (reads) are
  much more common than updates (writes).}
}

@InProceedings{karpovich:elfs,
  author = {John F. Karpovich and Andrew S. Grimshaw and James C. French},
  title = {Extensible File Systems {(ELFS)}: An Object-Oriented Approach to
  High Performance File {I/O}},
  booktitle = {Proceedings of the Ninth Annual Conference on Object-Oriented
  Programming Systems, Languages, and Applications},
  year = {1994},
  month = {October},
  pages = {191--204},
  publisher = {ACM Press},
  address = {Portland, OR},
  URL = {ftp://ftp.cs.virginia.edu/pub/techreports/CS-94-28.ps.Z},
  keywords = {parallel I/O, multiprocessor file system interface, object
  oriented, pario-bib},
  comment = {See also grimshaw:elfs, grimshaw:ELFSTR, grimshaw:objects, and
  karpovich:*. This is also available as UVA TR CS-94-28. This paper focuses
  more on the objec-oriented nature of ELFS than on its ability to support
  parallel I/O. It also describes two classes they've developed, one for 2d
  dense matrices and another for range queries on multidimensional sparse data
  sets. It does have some new performance numbers for ELFS on Intel CFS.}
}

@InCollection{katz:bdiskarch,
  author = {Randy H. Katz and Garth A. Gibson and David A. Patterson},
  title = {Disk System Architectures for High Performance Computing},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {2},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {15--34},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {katz:diskarch},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, RAID, disk array, disk striping, pario-bib},
  comment = {Part of jin:io-book; reformatted version of katz:diskarch.}
}

@Article{katz:diskarch,
  author = {Randy H. Katz and Garth A. Gibson and David A. Patterson},
  title = {Disk System Architectures for High Performance Computing},
  journal = {Proceedings of the IEEE},
  year = {1989},
  month = {December},
  volume = {77},
  number = {12},
  pages = {1842--1858},
  later = {katz:bdiskarch},
  keywords = {parallel I/O, RAID, disk array, disk striping, pario-bib},
  comment = {Good review of the background of disks and I/O architectures, but
  a shorter RAID presentation than patterson:RAID. Also addresses controller
  structure. Good ref for the I/O crisis background, though they don't use that
  term here. Good taxonomy of previous array techniques.}
}

@Article{katz:io-subsys,
  author = {Randy H. Katz and John K. Ousterhout and David A. Patterson and
  Michael R. Stonebraker},
  title = {A Project on High Performance {I/O} Subsystems},
  journal = {{IEEE} Database Engineering Bulletin},
  year = {1988},
  month = {March},
  volume = {11},
  number = {1},
  pages = {40--47},
  keywords = {parallel I/O, RAID, Sprite, reliability, disk striping, disk
  array, pario-bib},
  comment = {Early RAID project paper. Describes the Berkeley team's plan to
  use an array of small (100M) hard disks as an I/O server for network file
  service, transaction processing, and supercomputer I/O. Considering
  performance, reliability, and flexibility. Initially hooked to their SPUR
  multiprocessor, using Sprite operating system, new filesystem. Either
  asynchronous striped or independent operation. Supercomputer I/O is
  characterized as sequential, minimum latency, low throughput. Use of parity
  disks to boost reliability. Files may be striped across one or more disks and
  extend over several sectors, thus a two-dimensional filesystem; striping need
  not involve all disks.}
}

@InProceedings{katz:netfs,
  author = {Randy H. Katz},
  title = {Network-Attached Storage Systems},
  booktitle = {Scalable High Performance Computing Conference},
  year = {1992},
  pages = {68--75},
  keywords = {distributed file system, supercomputer file system, file
  striping, RAID, parallel I/O, pario-bib},
  comment = {Comments on the emerging trend of file systems for mainframes and
  supercomputers that are not attached directly to the computer, but instead to
  a network attached to the computer. Avoiding data copying seems to be a
  critical issue in the OS and controllers, for disk and network interfaces.
  Describes RAID-II prototype.}
}

@Article{katz:update,
  author = {Randy H. Katz and John K. Ousterhout and David A. Patterson and
  Peter Chen and Ann Chervenak and Rich Drewes and Garth Gibson and Ed Lee and
  Ken Lutz and Ethan Miller and Mendel Rosenblum},
  title = {A Project on High Performance {I/O} Subsystems},
  journal = {Computer Architecture News},
  year = {1989},
  month = {September},
  volume = {17},
  number = {5},
  pages = {24--31},
  keywords = {parallel I/O, RAID, reliability, disk array, pario-bib},
  comment = {A short summary of the RAID project. Some more up-to-date info,
  like that they have completed the first prototype with 8 SCSI strings and 32
  disks.}
}

@InProceedings{keane:commercial,
  author = {J. A. Keane and T. N. Franklin and A. J. Grant and R. Sumner and M.
  Q. Xu},
  title = {Commercial Users' Requirements for Parallel Systems},
  booktitle = {Proceedings of the 1993 DAGS/PC Symposium},
  year = {1993},
  month = {June},
  pages = {15--25},
  organization = {Dartmouth Institute for Advanced Graduate Studies},
  address = {Hanover, NH},
  keywords = {parallel architecture, parallel I/O, databases, commercial
  requirements, pario-bib},
  abstract = {This paper reports on part of an on-going analysis of parallel
  systems for commercial users. The particular focus of this paper is on the
  requirements that commercial users, in particular users with financial
  database systems, have of parallel systems. \par The issues of concern to
  such users differ from those of concern to science and engineering users.
  Performance of the parallel system is not the only, or indeed primary, reason
  for moving to such systems for commercial users. Infra-structure issues are
  important, such as system availability and inter-working with existing
  systems. \par These issues are discussed in the context of a banking
  customer's requirements. The various technical concerns that these
  requirements impose are discussed in terms of commercially available
  systems.}
}

@TechReport{kennedy:sio,
  author = {Ken Kennedy and Charles Koelbel and Mike Paleczny},
  title = {Scalable {I/O} for Out-of-Core Structures},
  year = {1993},
  month = {November},
  number = {CRPC-TR93357-S},
  institution = {Center for Research on Parallel Computation, Rice University},
  note = {Updated August, 1994},
  keywords = {parallel I/O, out-of-core, pario-bib},
  comment = {They describe a project they are beginning, which attempts to have
  the compiler analyze a program that uses large arrays, and insert explicit
  I/O statements to move data from those arrays to and from disk. This is seen
  as an alternative to OS and hardware virtual memory, and is likely to provide
  much better performance (so show their initial results). Their focus is on
  overlapping I/O and computation.}
}

@Article{kermarrec:ha-psls,
  author = {Anne-Marie Kermarrec and Christine Morin},
  title = {{HA-PSLS}: a highly available parallel single-level store system.},
  journal = {Concurrency and Computation Practice \& Experience},
  booktitle = {European Conference on Parallel Computing; August 2001; UNIV
  MANCHESTER; MANCHESTER, ENGLAND},
  year = {2003},
  month = {August},
  volume = {15},
  number = {10},
  pages = {911--937},
  institution = {Inst Natl Rech Informat \& Automat, IRISA, Campus Univ
  Beulieu, F-35042 Rennes, France; Inst Natl Rech Informat \& Automat, IRISA,
  F-35042 Rennes, France; Microsofr Res, Cambridge CB3 0FB, England},
  publisher = {UK : Wiley, 2003},
  copyright = {(c)2004 IEE; Institute for Scientific Information, Inc.},
  URL = {http://www.irisa.fr/paris/Biblio/Papers/Kermarrec/KerMor02ccpe.pdf},
  keywords = {parallel single level store, high-availability, fault tolerance,
  checkpointing, replication, integration, parallel file systems, shared
  virtual memory, pario-bib},
  abstract = {Parallel single-level store (PSLS) systems integrate a shared
  virtual memory and a parallel file system. They provide programmers with a
  global address space including both memory and file data. PSLS systems
  implemented in a cluster thus represent a natural support for long-running
  parallel applications, combining both the natural shared memory programming
  model and a large and efficient file system. However, the need to tolerate
  failures in such a system increases with the size of applications. We present
  a highly-available parallel single level store system (HA-PSLS), which
  smoothly integrates a backward error recovery high-availability mechanism
  into a PSLS system. Our system is able to tolerate multiple transient
  failures, a single permanent failure, and power cut failures affecting the
  whole cluster, without requiring any specialized hardware. For this purpose,
  HA-PSLS relies on a high degree of integration (and reusability) of
  high-availability and standard features. A prototype integrating our
  high-availability support has been implemented and we show some performance
  results. (24 refs.)}
}

@TechReport{khanna:group,
  author = {Sanjay Khanna and David Kotz},
  title = {A Split-Phase Interface for Parallel File Systems},
  year = {1997},
  month = {March},
  number = {PCS-TR97-312},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/151/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/khanna:group.pdf},
  keywords = {multiprocessor file system interface, run-time library, parallel
  file system, parallel I/O, pario-bib, dfk},
  abstract = {We describe the effects of a new user-level library for the
  Galley Parallel File System. This library allows some pre-existing sequential
  programs to make use of the Galley Parallel File System with minimal
  modification. It permits programs to efficiently use the parallel file system
  because the user-level library groups accesses together. We examine the
  performance of our library, and we show how code needs to be modified to use
  the library.}
}

@Article{kim:asynch,
  author = {Michelle Y. Kim and Asser N. Tantawi},
  title = {Asynchronous Disk Interleaving: {Approximating} Access Delays},
  journal = {IEEE Transactions on Computers},
  year = {1991},
  month = {July},
  volume = {40},
  number = {7},
  pages = {801--810},
  publisher = {IEEE Computer Society Press},
  keywords = {disk interleaving, parallel I/O, performance modeling,
  pario-bib},
  comment = {As opposed to synchronous disk interleaving, where disks are
  rotationally synchronous and one access is processed at a time. They develop
  a performance model and validate it with traces of a database system's disk
  accesses. Average access delay on each disk can be approximated by a normal
  distribution.}
}

@Article{kim:fft,
  author = {Michelle Y. Kim and Anil Nigam and George Paul and Robert H.
  Flynn},
  title = {Disk Interleaving and Very Large Fast {F}ourier Transforms},
  journal = {International Journal of Supercomputer Applications},
  year = {1987},
  volume = {1},
  number = {3},
  pages = {75--96},
  keywords = {parallel I/O, disk striping, scientific computing, algorithm,
  pario-bib}
}

@PhdThesis{kim:interleave,
  author = {Michelle Y. Kim},
  title = {Synchronously Interleaved Disk Systems with their Application to the
  Very Large {FFT}},
  year = {1986},
  school = {IBM Thomas J. Watson Research Center},
  address = {Yorktown Heights, New York 10598},
  note = {IBM Report number RC12372},
  earlier = {kim:interleaving},
  keywords = {parallel I/O, disk striping, file access pattern, disk array,
  pario-bib},
  comment = {Uniprocessor interleaving techniques. Good case for interleaving.
  Probably better to reference kim:interleaving and kim:fft. Discusses an 3D
  FFT algorithm in which the matrix is broken into subblocks that are accessed
  in layers. The layers are stored so this is either contiguous or with a
  regular stride, in fairly large chunks.}
}

@Article{kim:interleaving,
  author = {Michelle Y. Kim},
  title = {Synchronized Disk Interleaving},
  journal = {IEEE Transactions on Computers},
  year = {1986},
  month = {November},
  volume = {C-35},
  number = {11},
  pages = {978--988},
  publisher = {IEEE Computer Society Press},
  later = {kim:interleave},
  keywords = {parallel I/O, disk striping, disk array, pario-bib}
}

@InProceedings{kimbrel:prefetch,
  author = {Tracy Kimbrel and Pei Cao and Edward Felten and Anna Karlin and Kai
  Li},
  title = {Integrating Parallel Prefetching and Caching},
  booktitle = {Proceedings of the 1996 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1996},
  month = {May},
  pages = {262--263},
  publisher = {ACM Press},
  address = {Philadelphia, PA},
  note = {Poster paper.},
  keywords = {disk prefetching, parallel I/O, pario-bib},
  comment = {They do a theoretical analysis of prefetching and caching in
  uniprocessor, single- and multi-disk situations, given that they know the
  complete access sequence; their measure is not hit rate but rather overall
  execution time. They found some algorithms that are close to optimal.}
}

@InProceedings{kimbrel:prefetch-trace,
  author = {Tracy Kimbrel and Andrew Tomkins and R. Hugo Patterson and Brian
  Bershad and Pei Cao and Edward Felten and Garth Gibson and Anna R. Karlin and
  Kai Li},
  title = {A Trace-Driven Comparison of Algorithms for Parallel Prefetching and
  Caching},
  booktitle = {Proceedings of the 1996 Symposium on Operating Systems Design
  and Implementation},
  year = {1996},
  month = {October},
  pages = {19--34},
  publisher = {USENIX Association},
  URL =
  {http://www.usenix.org/publications/library/proceedings/osdi96/kimbrel.html},
  keywords = {parallel I/O, tracing, prefetch, trace-driven simulation,
  pario-bib},
  abstract = {High-performance I/O systems depend on prefetching and caching in
  order to deliver good performance to applications. These two techniques have
  generally been considered in isolation, even though there are significant
  interactions between them; a block prefetched too early reduces the
  effectiveness of the cache, while a block cached too long reduces the
  effectiveness of prefetching. In this paper we study the effects of several
  combined prefetching and caching strategies for systems with multiple disks.
  Using disk-accurate trace-driven simulation, we explore the performance
  characteristics of each of the algorithms in cases in which applications
  provide full advance knowledge of accesses using hints. Some of the
  strategies have been published with theoretical performance bounds, and some
  are components of systems that have been built. One is a new algorithm that
  combines the desirable characteristics of the others. We find that when
  performance is limited by I/O stalls, aggressive prefetching helps to
  alleviate the problem; that more conservative prefetching is appropriate when
  significant I/O stalls are not present; and that a single, simple strategy is
  capable of doing both.}
}

@InProceedings{kitsuregawa:sdc,
  author = {Masaru Kitsuregawa and Satoshi Hirano and Masanobu Harada and
  Minoru Nakamura and Mikio Takagi},
  title = {The {Super Database Computer (SDC)}: System Architecture, Algorithm
  and Preliminary Evaluation},
  booktitle = {Proceedings of the Twenty-Fifth Annual Hawaii International
  Conference on System Sciences},
  year = {1992},
  volume = {I},
  pages = {308--319},
  keywords = {parallel database, parallel I/O, pario-bib},
  comment = {Most interesting to me in this paper is their discussion of the
  ``container model,'' in which they claim they allow the processors to be
  driven by the I/O devices. See hirano:deadlock.}
}

@InProceedings{klaskey:data-streaming,
  author = {Scott Alan Klasky and Stephane Ethier and Zhihong Lin and Kevin
  Martins and Doug McCune and Ravi Samtaney},
  title = {Grid-Based Parallel Data Streaming implemented for the Gyrokinetic
  Toroidal Code},
  booktitle = {Proceedings of SC2003: High Performance Networking and
  Computing},
  year = {2003},
  month = {November},
  publisher = {IEEE Computer Society Press},
  address = {Phoenix, AZ},
  URL = {http://www.sc-conference.org/sc2003/paperpdfs/pap207.pdf},
  keywords = {grid, parallel data streams, hydrodynamics, application, parallel
  I/O, pario-app, pario-bib},
  abstract = {We have developed a threaded parallel data streaming approach
  using Globus to transfer multi-terabyte simulation data from a remote
  supercomputer to the scientist's home analysis/visualization cluster, as the
  simulation executes, with negligible overhead. Data transfer experiments show
  that this concurrent data transfer approach is more favorable compared with
  writing to local disk and then transferring this data to be post-processed.
  The present approach is conducive to using the grid to pipeline the
  simulation with post-processing and visualization. We have applied this
  method to the Gyrokinetic Toroidal Code (GTC), a 3-dimensional
  particle-in-cell code used to study micro-turbulence in magnetic confinement
  fusion from first principles plasma theory.},
  comment = {published on the web}
}

@InProceedings{klimkowski:solver,
  author = {Ken Klimkowski and Robert {van de Geijn}},
  title = {Anatomy of an Out-of-core Dense Linear Solver},
  booktitle = {Proceedings of the 1995 International Conference on Parallel
  Processing},
  year = {1995},
  pages = {III:29--33},
  publisher = {CRC Press},
  address = {St. Charles, IL},
  URL = {http://www.cs.utexas.edu/users/rvdg/papers/pipsolver.ps},
  keywords = {out-of-core algorithm, parallel I/O, pario-bib},
  abstract = {In this paper, we describe the design and implementation of the
  Platform Independent Parallel Solver (PIPSolver) package for the out-of-core
  (OOC) solution of complex dense linear systems. Our approach is unique in
  that it allows essentially all of RAM to be filled with the current portion
  of the matrix (slab) to be updated and factored, thereby greatly improving
  the computation to I/O ratio over previous approaches. This work could be
  viewed in part as an exercise in maximal code reuse: By formulating the OOC
  LU factorization just right, we managed to reuse essentially all of a very
  robust and efficient incore solver, leading directly to a very robust and
  efficient OOC solver. Experiences and performance are reported for the Cray
  T3D system.}
}

@InProceedings{kobler:eosdis,
  author = {Ben Kobler and John Berbert and Parris Caulk and P.~C. Hariharan},
  title = {Architecture and Design of Storage and Data Management for the {NASA
  Earth Observing System Data and Information System (EOSDIS)}},
  booktitle = {Proceedings of the Fourteenth IEEE Symposium on Mass Storage
  Systems},
  year = {1995},
  month = {September},
  pages = {65--76},
  publisher = {IEEE Computer Society Press},
  URL = {http://www.computer.org/conferen/mss95/kobler/kobler.htm},
  keywords = {mass storage, I/O architecture, parallel I/O, pario-bib},
  abstract = {Mission to Planet Earth (MTPE) is a long-term NASA research
  mission to study the processes leading to global climate change. The EOS Data
  and Information System (EOSDIS) is the component within MTPE that will
  provide the Earth science community with easy, affordable, and reliable
  access to Earth science data. EOSDIS is a distributed system, with major
  facilities at eight Distributed Active Archive Centers (DAACs) located
  throughout the United States. At the DAACs the Science Data Processing
  Segment (SDPS) will receive, process, archive, and manage all data. It is
  estimated that several hundred gigaflops of processing power will be required
  to process and archive the several terabytes of new data that will be
  generated and distributed daily. Thousands of science users and perhaps
  several hundred thousand nonscience users will access the system.}
}

@TechReport{kotz:app-pario,
  author = {David Kotz},
  title = {Applications of Parallel {I/O}},
  year = {1996},
  month = {October},
  number = {PCS-TR96-297},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {David Kotz},
  note = {Release 1},
  later = {oldfield:app-pario},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/139/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:app-pario.pdf},
  keywords = {parallel I/O application, file access patterns, dfk, pario-bib},
  abstract = {Scientific applications are increasingly being implemented on
  massively parallel supercomputers. Many of these applications have intense
  I/O demands, as well as massive computational requirements. This paper is
  essentially an annotated bibliography of papers and other sources of
  information about scientific applications using parallel I/O. It will be
  updated periodically.}
}

@InCollection{kotz:bdiskdir,
  author = {David Kotz},
  title = {Disk-directed {I/O} for {MIMD} Multiprocessors},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {35},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {513--535},
  publisher = {IEEE Computer Society Press and John Wiley \& Sons},
  copyright = {ACM},
  identical = {kotz:jdiskdir},
  keywords = {parallel I/O, multiprocessor file system, file system caching,
  dfk, pario-bib},
  abstract = {Many scientific applications that run on today's multiprocessors,
  such as weather forecasting and seismic analysis, are bottlenecked by their
  file-I/O needs. Even if the multiprocessor is configured with sufficient I/O
  hardware, the file-system software often fails to provide the available
  bandwidth to the application. Although libraries and enhanced file-system
  interfaces can make a significant improvement, we believe that fundamental
  changes are needed in the file-server software. We propose a new technique,
  disk-directed I/O, to allow the disk servers to determine the flow of data
  for maximum performance. Our simulations show that tremendous performance
  gains are possible both for simple reads and writes and for an out-of-core
  application. Indeed, our disk-directed I/O technique provided consistent high
  performance that was largely independent of data distribution, obtained up to
  93\% of peak disk bandwidth, and was as much as 18 times faster than the
  traditional technique.},
  comment = {In jin:io-book, reprinted from kotz:jdiskdir.}
}

@InCollection{kotz:bpractical,
  author = {David Kotz and Carla Schlatter Ellis},
  title = {Practical Prefetching Techniques for Multiprocessor File Systems},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {17},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {245--258},
  publisher = {IEEE Computer Society Press and John Wiley \& Sons},
  copyright = {Kluwer Academic Publishers},
  address = {New York, NY},
  identical = {kotz:jpractical},
  keywords = {dfk, parallel file system, prefetching, disk caching, parallel
  I/O, MIMD, pario-bib},
  abstract = {Improvements in the processing speed of multiprocessors are
  outpacing improvements in the speed of disk hardware. Parallel disk I/O
  subsystems have been proposed as one way to close the gap between processor
  and disk speeds. In a previous paper we showed that prefetching and caching
  have the potential to deliver the performance benefits of parallel file
  systems to parallel applications. In this paper we describe experiments with
  practical prefetching policies that base decisions only on on-line reference
  history, and that can be implemented efficiently. We also test the ability of
  these policies across a range of architectural parameters.},
  comment = {Reformatted version of kotz:jpractical. In jin:io-book.}
}

@InProceedings{kotz:diskdir,
  author = {David Kotz},
  title = {Disk-directed {I/O} for {MIMD} Multiprocessors},
  booktitle = {Proceedings of the 1994 Symposium on Operating Systems Design
  and Implementation},
  year = {1994},
  month = {November},
  pages = {61--74},
  publisher = {USENIX Association},
  copyright = {David Kotz},
  note = {Updated as Dartmouth TR PCS-TR94-226 on November 8, 1994},
  later = {kotz:diskdir-tr},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:diskdir.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:diskdir.pdf},
  keywords = {parallel I/O, multiprocessor file system, file system caching,
  pario-bib, dfk},
  abstract = {Many scientific applications that run on today's multiprocessors
  are bottlenecked by their file I/O needs. Even if the multiprocessor is
  configured with sufficient I/O hardware, the file-system software often fails
  to provide the available bandwidth to the application. Although libraries and
  improved file-system interfaces can make a significant improvement, we
  believe that fundamental changes are needed in the file-server software. We
  propose a new technique, {\em disk-directed I/O}, that flips the usual
  relationship between server and client to allow the disks (actually, disk
  servers) to determine the flow of data for maximum performance. Our
  simulations show that tremendous performance gains are possible. Indeed,
  disk-directed I/O provided consistent high performance that was largely
  independent of data distribution, and close to the maximum disk bandwidth.},
  comment = {This paper also appeared in Bulletin of the IEEE Technical
  Committee on Operating Systems and Application Environments, Autumn 1994, pp.
  29--42. Also available at
  http://www.usenix.org/publications/library/proceedings/osdi/kotz.html. \par
  SEE TECH REPORT kotz:diskdir-tr. Please note that the tech report contains
  newer numbers than those in the OSDI version, although the conclusions have
  not changed. \par Slides of OSDI presentation (Postscript, 988 Kbytes): <A
  HREF="http://www.cs.dartmouth.edu/~dfk/papers/OSDI-slides-1up.ps.gz"> one per
  page</A>, or <A
  HREF="http://www.cs.dartmouth.edu/~dfk/papers/OSDI-slides-2up.ps.gz"> two per
  page</A>.}
}

@TechReport{kotz:diskdir-tr,
  author = {David Kotz},
  title = {Disk-directed {I/O} for {MIMD} Multiprocessors},
  year = {1994},
  month = {July},
  number = {PCS-TR94-226},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {David Kotz},
  note = {Revised November 8, 1994},
  earlier = {kotz:diskdir},
  later = {kotz:jdiskdir},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/97/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:diskdir-tr.pdf},
  keywords = {parallel I/O, multiprocessor file system, file system caching,
  dfk, pario-bib},
  abstract = {Many scientific applications that run on today's multiprocessors
  are bottlenecked by their file I/O needs. Even if the multiprocessor is
  configured with sufficient I/O hardware, the file-system software often fails
  to provide the available bandwidth to the application. Although libraries and
  improved file-system interfaces can make a significant improvement, we
  believe that fundamental changes are needed in the file-server software. We
  propose a new technique, {\em disk-directed I/O}, that flips the usual
  relationship between server and client to allow the disks (actually, disk
  servers) to determine the flow of data for maximum performance. Our
  simulations show that tremendous performance gains are possible. Indeed,
  disk-directed I/O provided consistent high performance that was largely
  independent of data distribution, and close to the maximum disk bandwidth.},
  comment = {Short version appeared in OSDI'94. Please note that the revised
  tech report contains newer numbers than those in the OSDI version, although
  the conclusions have not changed.}
}

@Article{kotz:diskdir2,
  author = {David Kotz},
  title = {Disk-directed {I/O} for {MIMD} Multiprocessors},
  journal = {Bulletin of the IEEE Technical Committee on Operating Systems and
  Application Environments},
  year = {1994},
  month = {Autumn},
  pages = {29--42},
  publisher = {IEEE Computer Society Press},
  copyright = {David Kotz},
  later = {kotz:diskdir-tr},
  keywords = {parallel I/O, multiprocessor file system, file system caching,
  pario-bib, dfk},
  abstract = {Many scientific applications that run on today's multiprocessors
  are bottlenecked by their file I/O needs. Even if the multiprocessor is
  configured with sufficient I/O hardware, the file-system software often fails
  to provide the available bandwidth to the application. Although libraries and
  improved file-system interfaces can make a significant improvement, we
  believe that fundamental changes are needed in the file-server software. We
  propose a new technique, {\em disk-directed I/O}, that flips the usual
  relationship between server and client to allow the disks (actually, disk
  servers) to determine the flow of data for maximum performance. Our
  simulations show that tremendous performance gains are possible. Indeed,
  disk-directed I/O provided consistent high performance that was largely
  independent of data distribution, and close to the maximum disk bandwidth.},
  comment = {Same as kotz:diskdir. \par SEE TECH REPORT kotz:diskdir-tr. Please
  note that the tech report contains newer numbers than those in the OSDI
  version, although the conclusions have not changed.}
}

@InCollection{kotz:encyc1,
  author = {David Kotz and Ravi Jain},
  title = {{I/O} in Parallel and Distributed Systems},
  booktitle = {Encyclopedia of Computer Science and Technology},
  editor = {Allen Kent and James G. Williams},
  year = {1999},
  volume = {40},
  pages = {141--154},
  publisher = {Marcel Dekker, Inc.},
  copyright = {Marcel Dekker, Inc.},
  note = {Supplement 25},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:encyc1.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:encyc1.pdf},
  keywords = {survey, parallel I/O, pario-bib, dfk},
  abstract = {We sketch the reasons for the I/O bottleneck in parallel and
  distributed systems, pointing out that it can be viewed as a special case of
  a general bottleneck that arises at all levels of the memory hierarchy. We
  argue that because of its severity, the I/O bottleneck deserves systematic
  attention at all levels of system design. We then present a survey of the
  issues raised by the I/O bottleneck in six key areas of parallel and
  distributed systems: applications, algorithms, languages and compilers,
  run-time libraries, operating systems, and architecture.}
}

@InCollection{kotz:encyc2,
  author = {David Kotz},
  title = {Parallel Input/Output},
  booktitle = {Encyclopedia of Distributed Computing},
  editor = {Joseph Urban and Partha Dasgupta},
  year = {2002},
  publisher = {Kluwer Academic Publishers},
  copyright = {the author},
  note = {Accepted for publication},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:encyc2.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:encyc2.pdf},
  keywords = {survey, parallel I/O, pario-bib, dfk},
  comment = {A very short (2300 words) overview of Parallel I/O. See longer
  article kotz:encyc1. See better introductory material in iopads-book. It's
  not clear whether this encyclopedia will ever be published.}
}

@InProceedings{kotz:expand,
  author = {David Kotz},
  title = {Expanding the Potential for Disk-Directed {I/O}},
  booktitle = {Proceedings of the 1995 IEEE Symposium on Parallel and
  Distributed Processing},
  year = {1995},
  month = {October},
  pages = {490--495},
  publisher = {IEEE Computer Society Press},
  copyright = {IEEE},
  address = {San Antonio, TX},
  earlier = {kotz:expand-tr},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:expand.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:expand.pdf},
  keywords = {parallel I/O, multiprocessor file systems, dfk, pario-bib},
  abstract = {As parallel computers are increasingly used to run scientific
  applications with large data sets, and as processor speeds continue to
  increase, it becomes more important to provide fast, effective parallel file
  systems for data storage and for temporary files. In an earlier work we
  demonstrated that a technique we call disk-directed I/O has the potential to
  provide consistent high performance for large, collective, structured I/O
  requests. In this paper we expand on this potential by demonstrating the
  ability of a disk-directed I/O system to read irregular subsets of data from
  a file, and to filter and distribute incoming data according to
  data-dependent functions.}
}

@TechReport{kotz:expand-tr,
  author = {David Kotz},
  title = {Expanding the Potential for Disk-Directed {I/O}},
  year = {1995},
  month = {March},
  number = {PCS-TR95-254},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {David Kotz},
  later = {kotz:expand},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/115/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:expand-tr.pdf},
  keywords = {parallel I/O, multiprocessor file systems, dfk, pario-bib},
  abstract = {As parallel computers are increasingly used to run scientific
  applications with large data sets, and as processor speeds continue to
  increase, it becomes more important to provide fast, effective parallel file
  systems for data storage and for temporary files. In an earlier work we
  demonstrated that a technique we call disk-directed I/O has the potential to
  provide consistent high performance for large, collective, structured I/O
  requests. In this paper we expand on this potential by demonstrating the
  ability of a disk-directed I/O system to read irregular subsets of data from
  a file, and to filter and distribute incoming data according to
  data-dependent functions.}
}

@InProceedings{kotz:explore,
  author = {David Kotz and Ting Cai},
  title = {Exploring the use of {I/O} Nodes for Computation in a {MIMD}
  Multiprocessor},
  booktitle = {Proceedings of the IPPS~'95 Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1995},
  month = {April},
  pages = {78--89},
  copyright = {the authors},
  earlier = {kotz:explore-tr},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:explore.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:explore.pdf},
  keywords = {parallel I/O, multiprocessor file system, dfk, pario-bib},
  abstract = {As parallel systems move into the production scientific-computing
  world, the emphasis will be on cost-effective solutions that provide high
  throughput for a mix of applications. Cost-effective solutions demand that a
  system make effective use of all of its resources. Many MIMD multiprocessors
  today, however, distinguish between ``compute'' and ``I/O'' nodes, the latter
  having attached disks and being dedicated to running the file-system server.
  This static division of responsibilities simplifies system management but
  does not necessarily lead to the best performance in workloads that need a
  different balance of computation and I/O. \par Of course, computational
  processes sharing a node with a file-system service may receive less CPU
  time, network bandwidth, and memory bandwidth than they would on a
  computation-only node. In this paper we begin to examine this issue
  experimentally. We found that high-performance I/O does not necessarily
  require substantial CPU time, leaving plenty of time for application
  computation. There were some complex file-system requests, however, which
  left little CPU time available to the application. (The impact on network and
  memory bandwidth still needs to be determined.) For applications (or users)
  that cannot tolerate an occasional interruption, we recommend that they
  continue to use only compute nodes. For tolerant applications needing more
  cycles than those provided by the compute nodes, we recommend that they take
  full advantage of {\em both\/} compute and I/O nodes for computation, and
  that operating systems should make this possible.}
}

@TechReport{kotz:explore-tr,
  author = {David Kotz and Ting Cai},
  title = {Exploring the use of {I/O} Nodes for Computation in a {MIMD}
  Multiprocessor},
  year = {1994},
  month = {October},
  number = {PCS-TR94-232},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  note = {Revised 2/20/95},
  later = {kotz:explore},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/104/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:explore-tr.pdf},
  keywords = {parallel I/O, multiprocessor file system, dfk, pario-bib},
  abstract = {Most MIMD multiprocessors today are configured with two distinct
  types of processor nodes: those that have disks attached, which are dedicated
  to file I/O, and those that do not have disks attached, which are used for
  running applications. Several architectural trends have led some to propose
  configuring systems so that all processors are used for application
  processing, even those with disks attached. We examine this idea
  experimentally, focusing on the impact of remote I/O requests on local
  computational processes. We found that in an efficient file system the I/O
  processors can transfer data at near peak speeds with little CPU overhead,
  leaving substantial CPU power for running applications. On the other hand, we
  found that some complex file-system features could require substantial CPU
  overhead. Thus, for a multiprocessor system to obtain good I/O and
  computational performance on a mix of applications, the file system (both
  operating system and libraries) must be prepared to adapt their policies to
  changing conditions.}
}

@Article{kotz:flexibility,
  author = {David Kotz and Nils Nieuwejaar},
  title = {Flexibility and Performance of Parallel File Systems},
  journal = {ACM Operating Systems Review},
  year = {1996},
  month = {April},
  volume = {30},
  number = {2},
  pages = {63--73},
  publisher = {ACM Press},
  copyright = {the authors},
  later = {kotz:flexibility2},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:flexibility.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:flexibility.pdf},
  keywords = {parallel I/O, multiprocessor file system, dfk, pario-bib},
  abstract = {Many scientific applications for high-performance multiprocessors
  have tremendous I/O requirements. As a result, the I/O system is often the
  limiting factor of application performance. Several new parallel file systems
  have been developed in recent years, each promising better performance for
  some class of parallel applications. As we gain experience with parallel
  computing, and parallel file systems in particular, it becomes increasingly
  clear that a single solution does not suit all applications. For example, it
  appears to be impossible to find a single appropriate interface, caching
  policy, file structure, or disk management strategy. Furthermore, the
  proliferation of file-system interfaces and abstractions make application
  portability a significant problem. \par We propose that the traditional
  functionality of parallel file systems be separated into two components: a
  fixed core that is standard on all platforms, encapsulating only primitive
  abstractions and interfaces, and a set of high-level libraries to provide a
  variety of abstractions and application-programmer interfaces (APIs). We
  think of this approach as the ``RISC'' of parallel file-system design. \par
  We present our current and next-generation file systems as examples of this
  structure. Their features, such as a three-dimensional file structure,
  strided read and write interfaces, and I/O-node programs, are specifically
  designed with the flexibility and performance necessary to support a wide
  range of applications.},
  comment = {A position paper.}
}

@InProceedings{kotz:flexibility2,
  author = {David Kotz and Nils Nieuwejaar},
  title = {Flexibility and Performance of Parallel File Systems},
  booktitle = {Proceedings of the Third International Conference of the
  Austrian Center for Parallel Computation (ACPC)},
  year = {1996},
  month = {September},
  series = {Lecture Notes in Computer Science},
  volume = {1127},
  pages = {1--11},
  publisher = {Springer-Verlag},
  copyright = {Springer-Verlag},
  earlier = {kotz:flexibility},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:flexibility2.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:flexibility2.pdf},
  keywords = {parallel I/O, multiprocessor file system, dfk, pario-bib},
  abstract = {As we gain experience with parallel file systems, it becomes
  increasingly clear that a single solution does not suit all applications. For
  example, it appears to be impossible to find a single appropriate interface,
  caching policy, file structure, or disk-management strategy. Furthermore, the
  proliferation of file-system interfaces and abstractions make applications
  difficult to port. \par We propose that the traditional functionality of
  parallel file systems be separated into two components: a fixed core that is
  standard on all platforms, encapsulating only primitive abstractions and
  interfaces, and a set of high-level libraries to provide a variety of
  abstractions and application-programmer interfaces (APIs). \par We present
  our current and next-generation file systems as examples of this structure.
  Their features, such as a three-dimensional file structure, strided read and
  write interfaces, and I/O-node programs, re specifically designed with the
  flexibility and performance necessary to support a wide range of
  applications.},
  comment = {Nearly identical to kotz:flexibility. The only changes are the
  format, a shorter abstract, and updates to Section 7 and the references.}
}

@TechReport{kotz:fsint,
  author = {David Kotz},
  title = {Multiprocessor File System Interfaces},
  year = {1992},
  month = {March},
  number = {PCS-TR92-179},
  institution = {Dept. of Math and Computer Science, Dartmouth College},
  copyright = {David Kotz},
  note = {Revised version appeared in PDIS'93},
  later = {kotz:fsint2},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/74/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:fsint.pdf},
  keywords = {dfk, parallel I/O, multiprocessor file system, file system
  interface, pario-bib},
  abstract = {Increasingly, file systems for multiprocessors are designed with
  parallel access to multiple disks, to keep I/O from becoming a serious
  bottleneck for parallel applications. Although file system software can
  transparently provide high-performance access to parallel disks, a new file
  system interface is needed to facilitate parallel access to a file from a
  parallel application. We describe the difficulties faced when using the
  conventional (Unix-like) interface in parallel applications, and then outline
  ways to extend the conventional interface to provide convenient access to the
  file for parallel programs, while retaining the traditional interface for
  programs that have no need for explicitly parallel file access. Our interface
  includes a single naming scheme, a {\em multiopen\/} operation, local and
  global file pointers, mapped file pointers, logical records, {\em
  multifiles}, and logical coercion for backward compatibility.},
  comment = {See also lake:pario for implementation of some of the ideas.}
}

@InProceedings{kotz:fsint2,
  author = {David Kotz},
  title = {Multiprocessor File System Interfaces},
  booktitle = {Proceedings of the Second International Conference on Parallel
  and Distributed Information Systems},
  year = {1993},
  pages = {194--201},
  publisher = {IEEE Computer Society Press},
  copyright = {IEEE},
  earlier = {kotz:fsint},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:fsint2.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:fsint2.pdf},
  keywords = {dfk, parallel I/O, multiprocessor file system, file system
  interface, pario-bib},
  abstract = {Increasingly, file systems for multiprocessors are designed with
  parallel access to multiple disks, to keep I/O from becoming a serious
  bottleneck for parallel applications. Although file system software can
  transparently provide high-performance access to parallel disks, a new file
  system interface is needed to facilitate parallel access to a file from a
  parallel application. We describe the difficulties faced when using the
  conventional (Unix-like) interface in parallel applications, and then outline
  ways to extend the conventional interface to provide convenient access to the
  file for parallel programs, while retaining the traditional interface for
  programs that have no need for explicitly parallel file access. Our interface
  includes a single naming scheme, a {\em multiopen\/} operation, local and
  global file pointers, mapped file pointers, logical records, {\em
  multifiles}, and logical coercion for backward compatibility.},
  comment = {See also lake:pario for implementation of some of the ideas.}
}

@InProceedings{kotz:fsint2p,
  author = {David Kotz},
  title = {Multiprocessor File System Interfaces},
  booktitle = {Proceedings of the USENIX File Systems Workshop},
  year = {1992},
  month = {May},
  pages = {149--150},
  publisher = {USENIX Association},
  copyright = {David Kotz},
  later = {kotz:fsint2},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:fsint2p.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:fsint2p.pdf},
  keywords = {dfk, parallel I/O, multiprocessor file system, file system
  interface, pario-bib},
  comment = {Short paper (2 pages).}
}

@TechReport{kotz:int-ddio,
  author = {David Kotz},
  title = {Interfaces for Disk-Directed {I/O}},
  year = {1995},
  month = {September},
  number = {PCS-TR95-270},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {David Kotz},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/122/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:int-ddio.pdf},
  keywords = {disk-directed I/O, parallel I/O, multiprocessor filesystem
  interfaces, pario-bib, dfk},
  abstract = {In other papers I propose the idea of disk-directed I/O for
  multiprocessor file systems. Those papers focus on the performance advantages
  and capabilities of disk-directed I/O, but say little about the
  application-programmer's interface or about the interface between the compute
  processors and I/O processors. In this short note I discuss the requirements
  for these interfaces, and look at many existing interfaces for parallel file
  systems. I conclude that many of the existing interfaces could be adapted for
  use in a disk-directed I/O system.},
  comment = {See also kotz:jdiskdir, kotz:expand, and kotz:lu.}
}

@Misc{kotz:iobib,
  author = {David Kotz},
  title = {Bibliography about {Parallel I/O}},
  year = {1994--2000},
  howpublished = {Available on the WWW at {\tt
  http://www.cs.dartmouth.edu/pario/bib/}},
  URL = {http://www.cs.dartmouth.edu/pario/bib/},
  keywords = {parallel I/O, multiprocessor file system, dfk, pario-bib},
  comment = {A bibliography of many references on parallel I/O and
  multiprocessor file-systems issues. As of the fifth edition, it is available
  on the WWW in HTML format.}
}

@Article{kotz:jdiskdir,
  author = {David Kotz},
  title = {Disk-directed {I/O} for {MIMD} Multiprocessors},
  journal = {ACM Transactions on Computer Systems},
  year = {1997},
  month = {February},
  volume = {15},
  number = {1},
  pages = {41--74},
  publisher = {ACM Press},
  copyright = {ACM},
  identical = {kotz:bdiskdir},
  earlier = {kotz:diskdir-tr},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:jdiskdir.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:jdiskdir.pdf},
  keywords = {parallel I/O, multiprocessor file system, file system caching,
  dfk, pario-bib},
  abstract = {Many scientific applications that run on today's multiprocessors,
  such as weather forecasting and seismic analysis, are bottlenecked by their
  file-I/O needs. Even if the multiprocessor is configured with sufficient I/O
  hardware, the file-system software often fails to provide the available
  bandwidth to the application. Although libraries and enhanced file-system
  interfaces can make a significant improvement, we believe that fundamental
  changes are needed in the file-server software. We propose a new technique,
  disk-directed I/O, to allow the disk servers to determine the flow of data
  for maximum performance. Our simulations show that tremendous performance
  gains are possible both for simple reads and writes and for an out-of-core
  application. Indeed, our disk-directed I/O technique provided consistent high
  performance that was largely independent of data distribution, obtained up to
  93\% of peak disk bandwidth, and was as much as 18 times faster than the
  traditional technique.},
  comment = {This paper is a substantial revision of the diskdir-tr version:
  all of the experiments have been re-done, using a better-tuned version of the
  file systems (see kotz:tuning), and adding two-phase I/O to all comparisons.
  It also incorporates some of the material from kotz:expand and kotz:int-ddio.
  Also available at
  http://www.acm.org/pubs/citations/journals/tocs/1997-15-1/p41-kotz/.}
}

@Article{kotz:jpractical,
  author = {David Kotz and Carla Schlatter Ellis},
  title = {Practical Prefetching Techniques for Multiprocessor File Systems},
  journal = {Journal of Distributed and Parallel Databases},
  year = {1993},
  month = {January},
  volume = {1},
  number = {1},
  pages = {33--51},
  publisher = {Kluwer Academic Publishers},
  copyright = {Kluwer Academic Publishers},
  identical = {kotz:bpractical},
  earlier = {kotz:practical},
  keywords = {dfk, parallel file system, prefetching, disk caching, parallel
  I/O, MIMD, pario-bib},
  abstract = {Improvements in the processing speed of multiprocessors are
  outpacing improvements in the speed of disk hardware. Parallel disk I/O
  subsystems have been proposed as one way to close the gap between processor
  and disk speeds. In a previous paper we showed that prefetching and caching
  have the potential to deliver the performance benefits of parallel file
  systems to parallel applications. In this paper we describe experiments with
  practical prefetching policies that base decisions only on on-line reference
  history, and that can be implemented efficiently. We also test the ability of
  these policies across a range of architectural parameters.},
  comment = {See also kotz:jwriteback, kotz:fsint2, cormen:integrate.}
}

@Article{kotz:jworkload,
  author = {David Kotz and Nils Nieuwejaar},
  title = {File-System Workload on a Scientific Multiprocessor},
  journal = {IEEE Parallel and Distributed Technology},
  year = {1995},
  month = {Spring},
  volume = {3},
  number = {1},
  pages = {51--60},
  publisher = {IEEE Computer Society Press},
  copyright = {IEEE},
  earlier = {kotz:workload},
  later = {nieuwejaar:workload-tr},
  URL = {http://computer.org/concurrency/pd1995/p1051abs.htm},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:jworkload.pdf},
  keywords = {parallel file system, file access pattern, multiprocessor file
  system workload, parallel I/O, pario-bib, dfk}
}

@Article{kotz:jwriteback,
  author = {David Kotz and Carla Schlatter Ellis},
  title = {Caching and Writeback Policies in Parallel File Systems},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1993},
  month = {January and February},
  volume = {17},
  number = {1--2},
  pages = {140--145},
  publisher = {Academic Press},
  copyright = {Academic Press},
  earlier = {kotz:writeback},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:jwriteback.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:jwriteback.pdf},
  keywords = {dfk, parallel file system, disk caching, parallel I/O, MIMD,
  pario-bib},
  abstract = {Improvements in the processing speed of multiprocessors are
  outpacing improvements in the speed of disk hardware. Parallel disk I/O
  subsystems have been proposed as one way to close the gap between processor
  and disk speeds. Such parallel disk systems require parallel file system
  software to avoid performance-limiting bottlenecks. We discuss cache
  management techniques that can be used in a parallel file system
  implementation for multiprocessors with scientific workloads. We examine
  several writeback policies, and give results of experiments that test their
  performance.},
  comment = {See kotz:jpractical, kotz:fsint2, cormen:integrate.}
}

@InProceedings{kotz:lu,
  author = {David Kotz},
  title = {Disk-directed {I/O} for an Out-of-core Computation},
  booktitle = {Proceedings of the Fourth IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1995},
  month = {August},
  pages = {159--166},
  publisher = {IEEE Computer Society Press},
  copyright = {IEEE},
  earlier = {kotz:lu-tr},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:lu.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:lu.pdf},
  keywords = {parallel I/O, numerical analysis, dfk, pario-bib},
  abstract = {New file systems are critical to obtain good I/O performance on
  large multiprocessors. Several researchers have suggested the use of {\em
  collective\/} file-system operations, in which all processes in an
  application cooperate in each I/O request. Others have suggested that the
  traditional low-level interface ({\tt read, write, seek}) be augmented with
  various higher-level requests (e.g., {\em read matrix}). Collective,
  high-level requests permit a technique called {\em disk-directed I/O\/} to
  significantly improve performance over traditional file systems and
  interfaces, at least on simple I/O benchmarks. In this paper, we present the
  results of experiments with an ``out-of-core'' LU-decomposition program.
  Although its collective interface was awkward in some places, and forced
  additional synchronization, disk-directed I/O was able to obtain much better
  overall performance than the traditional system.}
}

@TechReport{kotz:lu-tr,
  author = {David Kotz},
  title = {Disk-directed {I/O} for an Out-of-core Computation},
  year = {1995},
  month = {January},
  number = {PCS-TR95-251},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {David Kotz},
  later = {kotz:lu},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/112/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:lu-tr.pdf},
  keywords = {parallel I/O, numerical analysis, dfk, pario-bib},
  abstract = {New file systems are critical to obtain good I/O performance on
  large multiprocessors. Several researchers have suggested the use of {\em
  collective\/} file-system operations, in which all processes in an
  application cooperate in each I/O request. Others have suggested that the
  traditional low-level interface ({\tt read, write, seek}) be augmented with
  various higher-level requests (e.g., {\em read matrix}), allowing the
  programmer to express a complex transfer in a single (perhaps collective)
  request. Collective, high-level requests permit techniques like {\em
  two-phase I/O\/} and {\em disk-directed I/O\/} to significantly improve
  performance over traditional file systems and interfaces. Neither of these
  techniques have been tested on anything other than simple benchmarks that
  read or write matrices. Many applications, however, intersperse computation
  and I/O to work with data sets that cannot fit in main memory. In this paper,
  we present the results of experiments with an ``out-of-core''
  LU-decomposition program, comparing a traditional interface and file system
  with a system that has a high-level, collective interface and disk-directed
  I/O. We found that a collective interface was awkward in some places, and
  forced additional synchronization. Nonetheless, disk-directed I/O was able to
  obtain much better performance than the traditional system.}
}

@InCollection{kotz:pioarch,
  author = {David Kotz},
  title = {Introduction to Multiprocessor {I/O} Architecture},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {4},
  editor = {Ravi Jain and John Werth and James C. Browne},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {97--123},
  publisher = {Kluwer Academic Publishers},
  copyright = {Kluwer Academic Publishers},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:pioarch.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:pioarch.pdf},
  keywords = {parallel I/O, multiprocessor file system, pario-bib, dfk},
  abstract = {The computational performance of multiprocessors continues to
  improve by leaps and bounds, fueled in part by rapid improvements in
  processor and interconnection technology. I/O performance thus becomes ever
  more critical, to avoid becoming the bottleneck of system performance. In
  this paper we provide an introduction to I/O architectural issues in
  multiprocessors, with a focus on disk subsystems. While we discuss examples
  from actual architectures and provide pointers to interesting research in the
  literature, we do not attempt to provide a comprehensive survey. We
  concentrate on a study of the architectural design issues, and the effects of
  different design alternatives.},
  comment = {Invited paper. Part of a whole book on parallel I/O; see
  iopads-book.}
}

@InProceedings{kotz:practical,
  author = {David Kotz and Carla Schlatter Ellis},
  title = {Practical Prefetching Techniques for Parallel File Systems},
  booktitle = {Proceedings of the First International Conference on Parallel
  and Distributed Information Systems},
  year = {1991},
  month = {December},
  pages = {182--189},
  publisher = {IEEE Computer Society Press},
  copyright = {IEEE},
  earlier = {kotz:thesis},
  later = {kotz:jpractical},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:practical.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:practical.pdf},
  keywords = {dfk, parallel file system, prefetching, disk caching, parallel
  I/O, MIMD, OS93W extra, OS92W, pario-bib},
  abstract = {Parallel disk subsystems have been proposed as one way to close
  the gap between processor and disk speeds. In a previous paper we showed that
  prefetching and caching have the potential to deliver the performance
  benefits of parallel file systems to parallel applications. In this paper we
  describe experiments with practical prefetching policies, and show that
  prefetching can be implemented efficiently even for the more complex parallel
  file access patterns. We test these policies across a range of architectural
  parameters.},
  comment = {Short form of primary thesis results. See kotz:jwriteback,
  kotz:fsint2, cormen:integrate.}
}

@Article{kotz:prefetch,
  author = {David F. Kotz and Carla Schlatter Ellis},
  title = {Prefetching in File Systems for {MIMD} Multiprocessors},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {1990},
  month = {April},
  volume = {1},
  number = {2},
  pages = {218--230},
  publisher = {IEEE Computer Society Press},
  copyright = {IEEE},
  earlier = {ellis:prefetch},
  later = {kotz:thesis},
  keywords = {dfk, parallel file system, prefetching, MIMD, disk caching,
  parallel I/O, pario-bib},
  abstract = {The problem of providing file I/O to parallel programs has been
  largely neglected in the development of multiprocessor systems. There are two
  essential elements of any file system design intended for a highly parallel
  environment: parallel I/O and effective caching schemes. This paper
  concentrates on the second aspect of file system design and specifically, on
  the question of whether prefetching blocks of the file into the block cache
  can effectively reduce overall execution time of a parallel computation, even
  under favorable assumptions. \par Experiments have been conducted with an
  interleaved file system testbed on the Butterfly Plus multiprocessor. Results
  of these experiments suggest that 1) the hit ratio, the accepted measure in
  traditional caching studies, may not be an adequate measure of performance
  when the workload consists of parallel computations and parallel file access
  patterns, 2) caching with prefetching can significantly improve the hit ratio
  and the average time to perform an I/O operation, and 3) an improvement in
  overall execution time has been observed in most cases. In spite of these
  gains, prefetching sometimes results in increased execution times (a negative
  result, given the optimistic nature of the study). \par We explore why is it
  not trivial to translate savings on individual I/O requests into consistently
  better overall performance and identify the key problems that need to be
  addressed in order to improve the potential of prefetching techniques in this
  environment.}
}

@PhdThesis{kotz:thesis,
  author = {David Kotz},
  title = {Prefetching and Caching Techniques in File Systems for {MIMD}
  Multiprocessors},
  year = {1991},
  month = {April},
  school = {Duke University},
  copyright = {David Kotz},
  note = {Available as technical report CS-1991-016},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:thesis.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:thesis.pdf},
  keywords = {dfk, parallel file system, prefetching, MIMD, disk caching,
  parallel I/O, pario-bib},
  abstract = {The increasing speed of the most powerful computers, especially
  multiprocessors, makes it difficult to provide sufficient I/O bandwidth to
  keep them running at full speed for the largest problems. Trends show that
  the difference in the speed of disk hardware and the speed of processors is
  increasing, with I/O severely limiting the performance of otherwise fast
  machines. This widening access-time gap is known as the ``I/O bottleneck
  crisis.'' One solution to the crisis, suggested by many researchers, is to
  use many disks in parallel to increase the overall bandwidth. \par This
  dissertation studies some of the file system issues needed to get high
  performance from parallel disk systems, since parallel hardware alone cannot
  guarantee good performance. The target systems are large MIMD multiprocessors
  used for scientific applications, with large files spread over multiple disks
  attached in parallel. The focus is on automatic caching and prefetching
  techniques. We show that caching and prefetching can transparently provide
  the power of parallel disk hardware to both sequential and parallel
  applications using a conventional file system interface. We also propose a
  new file system interface (compatible with the conventional interface) that
  could make it easier to use parallel disks effectively. \par Our methodology
  is a mixture of implementation and simulation, using a software testbed that
  we built to run on a BBN GP1000 multiprocessor. The testbed simulates the
  disks and fully implements the caching and prefetching policies. Using a
  synthetic workload as input, we use the testbed in an extensive set of
  experiments. The results show that prefetching and caching improved the
  performance of parallel file systems, often dramatically.},
  comment = {Published as kotz:prefetch, kotz:jwriteback, kotz:jpractical,
  kotz:fsint2.}
}

@TechReport{kotz:throughput,
  author = {David Kotz},
  title = {Throughput of Existing Multiprocessor File Systems},
  year = {1993},
  month = {May},
  number = {PCS-TR93-190},
  institution = {Dept. of Math and Computer Science, Dartmouth College},
  copyright = {David Kotz},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/81/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:throughput.pdf},
  keywords = {parallel I/O, multiprocessor file system, performance, survey,
  dfk, pario-bib},
  comment = {A brief note on the reported performance of existing file systems
  (Intel CFS, nCUBE, CM-2, CM-5, and Cray). Many have disappointingly low
  absolute throughput, in MB/s.}
}

@TechReport{kotz:tuning,
  author = {David Kotz},
  title = {Tuning {STARFISH}},
  year = {1996},
  month = {October},
  number = {PCS-TR96-296},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {David Kotz},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/138/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:tuning.pdf},
  keywords = {parallel I/O, multiprocessor file system, dfk, pario-bib},
  abstract = {STARFISH is a parallel file-system simulator we built for our
  research into the concept of disk-directed I/O. In this report, we detail
  steps taken to tune the file systems supported by STARFISH, which include a
  traditional parallel file system (with caching) and a disk-directed I/O
  system. In particular, we now support two-phase I/O, use smarter disk
  scheduling, increased the maximum number of outstanding requests that a
  compute processor may make to each disk, and added gather/scatter block
  transfer. We also present results of the experiments driving the tuning
  effort.},
  comment = {Reports on some new changes to the STARFISH simulator that
  implements traditional caching and disk-directed I/O. This is meant mainly as
  a companion to kotz:jdiskdir. See also kotz:jdiskdir, kotz:diskdir,
  kotz:expand.}
}

@InProceedings{kotz:workload,
  author = {David Kotz and Nils Nieuwejaar},
  title = {Dynamic File-Access Characteristics of a Production Parallel
  Scientific Workload},
  booktitle = {Proceedings of Supercomputing '94},
  year = {1994},
  month = {November},
  pages = {640--649},
  publisher = {IEEE Computer Society Press},
  copyright = {IEEE},
  address = {Washington, DC},
  earlier = {kotz:workload-tr},
  later = {kotz:jworkload},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:workload.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:workload.pdf},
  keywords = {parallel file system, file access pattern, multiprocessor file
  system workload, parallel I/O, pario-bib, dfk},
  abstract = {Multiprocessors have permitted astounding increases in
  computational performance, but many cannot meet the intense I/O requirements
  of some scientific applications. An important component of any solution to
  this I/O bottleneck is a parallel file system that can provide high-bandwidth
  access to tremendous amounts of data {\em in parallel\/} to hundreds or
  thousands of processors. \par Most successful systems are based on a solid
  understanding of the characteristics of the expected workload, but until now
  there have been no comprehensive workload characterizations of multiprocessor
  file systems. We began the CHARISMA project in an attempt to fill that gap.
  We instrumented the common node library on the iPSC/860 at NASA Ames to
  record all file-related activity over a two-week period. Our instrumentation
  is different from previous efforts in that it collects information about
  every read and write request and about the {\em mix\/} of jobs running in the
  machine (rather than from selected applications). \par The trace analysis in
  this paper leads to many recommendations for designers of multiprocessor file
  systems. First, the file system should support simultaneous access to many
  different files by many jobs. Second, it should expect to see many small
  requests, predominantly sequential and regular access patterns (although of a
  different form than in uniprocessors), little or no concurrent file-sharing
  between jobs, significant byte- and block-sharing between processes within
  jobs, and strong interprocess locality. Third, our trace-driven simulations
  showed that these characteristics led to great success in caching, both at
  the compute nodes and at the I/O~nodes. Finally, we recommend supporting
  strided I/O requests in the file-system interface, to reduce overhead and
  allow more performance optimization by the file system.},
  comment = {Also at
  http://www.acm.org/pubs/citations/proceedings/supercomputing/198354/p640-kotz
  and http://computer.org/conferen/sc94/kotz.html}
}

@TechReport{kotz:workload-tr,
  author = {David Kotz and Nils Nieuwejaar},
  title = {Dynamic File-Access Characteristics of a Production Parallel
  Scientific Workload},
  year = {1994},
  month = {April},
  number = {PCS-TR94-211},
  institution = {Dept. of Math and Computer Science, Dartmouth College},
  copyright = {the authors},
  note = {Revised May 11, 1994},
  later = {kotz:workload},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/98/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:workload-tr.pdf},
  keywords = {parallel file system, file access pattern, multiprocessor file
  system workload, parallel I/O, pario-bib, dfk},
  abstract = {Multiprocessors have permitted astounding increases in
  computational performance, but many cannot meet the intense I/O requirements
  of some scientific applications. An important component of any solution to
  this I/O bottleneck is a parallel file system that can provide high-bandwidth
  access to tremendous amounts of data {\em in parallel\/} to hundreds or
  thousands of processors. \par Most successful systems are based on a solid
  understanding of the characteristics of the expected workload, but until now
  there have been no comprehensive workload characterizations of multiprocessor
  file systems. We began the CHARISMA project in an attempt to fill that gap.
  We instrumented the common node library on the iPSC/860 at NASA Ames to
  record all file-related activity over a two-week period. Our instrumentation
  is different from previous efforts in that it collects information about
  every read and write request and about the {\em mix\/} of jobs running in the
  machine (rather than from selected applications). \par The trace analysis in
  this paper leads to many recommendations for designers of multiprocessor file
  systems. First, the file system should support simultaneous access to many
  different files by many jobs. Second, it should expect to see many small
  requests, predominantly sequential and regular access patterns (although of a
  different form than in uniprocessors), little or no concurrent file-sharing
  between jobs, significant byte- and block-sharing between processes within
  jobs, and strong interprocess locality. Third, our trace-driven simulations
  showed that these characteristics led to great success in caching, both at
  the compute nodes and at the I/O~nodes. Finally, we recommend supporting
  strided I/O requests in the file-system interface, to reduce overhead and
  allow more performance optimization by the file system.}
}

@InProceedings{kotz:writeback,
  author = {David Kotz and Carla Schlatter Ellis},
  title = {Caching and Writeback Policies in Parallel File Systems},
  booktitle = {IEEE Symposium on Parallel and Distributed Processing},
  year = {1991},
  month = {December},
  pages = {60--67},
  publisher = {IEEE Computer Society Press},
  copyright = {IEEE},
  earlier = {kotz:thesis},
  later = {kotz:jwriteback},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:writeback.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/kotz:writeback.pdf},
  keywords = {dfk, parallel file system, disk caching, parallel I/O, MIMD,
  pario-bib},
  abstract = {Improvements in the processing speed of multiprocessors are
  outpacing improvements in the speed of disk hardware. Parallel disk I/O
  subsystems have been proposed as one way to close the gap between processor
  and disk speeds. Such parallel disk systems require parallel file system
  software to avoid performance-limiting bottlenecks. We discuss cache
  management techniques that can be used in a parallel file system
  implementation. We examine several writeback policies, and give results of
  experiments that test their performance.},
  comment = {See also kotz:jpractical, kotz:fsint2, cormen:integrate.}
}

@Article{krammer:marmot,
  author = {Bettina Krammer and Matthias S. M{\"u}ller and Michael M. Resch},
  title = {{MPI I/O} analysis and error detection with {MARMOT}},
  journal = {Lecture Notes in Computer Science},
  booktitle = {Proceedings of the 11th European Parallel Virtural Machine and
  Message Passing Interface Users Group Meeting},
  editor = {Kranzlmuller, D; Kacsuk, P; Dongarra, J},
  year = {2004},
  month = {September},
  volume = {3241},
  pages = {242--250},
  institution = {Ctr High Performance Comp, Allmandring 30, D-70550 Stuttgart,
  Germany; Ctr High Performance Comp, D-70550 Stuttgart, Germany},
  publisher = {SPRINGER-VERLAG BERLIN},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  address = {Budapest, Hungary},
  URL = {http://springerlink.metapress.com/link.asp?id=up8fqm0vlua6pjgl},
  keywords = {MPI I/O, error detection, performance analysis, MARMOT,
  pario-bib},
  abstract = {The most frequently used part of MPI-2 is MPI I/O. Due to the
  complexity of parallel programming in general, and of handling parallel I/O
  in particular, there is a need for tools that support the application
  development process. There axe many situations where incorrect usage of MPI
  by the application programmer can be automatically detected. In this paper we
  describe the MARMOT tool that uncovers some of these errors and we also
  analyze to what extent it is possible to do so for MPI I/O.}
}

@Article{krieger:asf,
  author = {Orran Krieger and Michael Stumm and Ronald Unrau},
  title = {The {Alloc Stream Facility}: A Redesign of Application-level Stream
  {I/O}},
  journal = {IEEE Computer},
  year = {1994},
  month = {March},
  volume = {27},
  number = {3},
  pages = {75--82},
  publisher = {IEEE Computer Society Press},
  earlier = {krieger:asf-tr},
  keywords = {memory-mapped file, file system, parallel I/O, pario-bib}
}

@TechReport{krieger:asf-tr,
  author = {Orran Krieger and Michael Stumm and Ronald Unrau},
  title = {The {Alloc Stream Facility}: A Redesign of Application-level Stream
  {I/O}},
  year = {1992},
  month = {October},
  number = {CSRI-275},
  institution = {Computer Systems Research Institute, University of Toronto},
  address = {Toronto, Canada, M5S 1A1},
  later = {krieger:asf},
  keywords = {memory-mapped file, file system, parallel I/O, pario-bib},
  abstract = {This paper describes the design and implementation of a new
  application level I/O facility, called the Alloc Stream Facility. The Alloc
  Stream Facility has several key advantages. First, performance is
  substantially improved as a result of a) the structure of the facility that
  allows it to take advantage of system specific features like mapped files,
  and b) a reduction in data copying and the number of I/O system calls.
  Second, the facility is designed for multi-threaded applications running on
  multiprocessors and allows for a high degree of concurrency. Finally, the
  facility can support a variety of I/O interfaces, including stdio, emulated
  Unix I/O, ASI, and C++ streams, in a way that allows applications to freely
  intermix calls to the different interfaces, resulting in improved code
  reusability. \par We show that on several Unix workstation platforms the
  performance of Unix applications using the Alloc Stream Facility can be
  substantially better that when the applications use the original I/O
  facilities.},
  comment = {See also krieger:mapped. ``This is an extended version of the
  paper with the same title in the March, 1994 edition of IEEE Computer.'' A
  3-level interface structure: interface, backplane, and stream-specific
  modules. Different interfaces available: unix, stdio, ASI (theirs), C++.
  Common backplane. Stream-specific implementations that export operations like
  salloc and sfree, which return pointers to data buffers. ASI exports that
  interface to the user, for maximum efficiency. Performance is best when using
  mapped files as underlying implementation. Many stdio or unix apps are faster
  only after relinking. ASI is even faster. In addition to better performance,
  also get multithreading support, multiple interfaces, and extensibility.}
}

@InProceedings{krieger:hfs,
  author = {Orran Krieger and Michael Stumm},
  title = {{HFS:} A Flexible File System for large-scale Multiprocessors},
  booktitle = {Proceedings of the 1993 DAGS/PC Symposium},
  year = {1993},
  month = {June},
  pages = {6--14},
  organization = {Dartmouth Institute for Advanced Graduate Studies},
  address = {Hanover, NH},
  later = {krieger:hfs2},
  URL = {ftp://ftp.cs.toronto.edu/pub/parallel/Krieger_Stumm_DAGS93.ps.Z},
  keywords = {multiprocessor file system, parallel I/O, operating system,
  shared memory, pario-bib},
  abstract = {The {H{\sc urricane}} File System (HFS) is a new file system
  being developed for large-scale shared memory multiprocessors with
  distributed disks. The main goal of this file system is scalability; that is,
  the file system is designed to handle demands that are expected to grow
  linearly with the number of processors in the system. To achieve this goal,
  HFS is designed using a new structuring technique called Hierarchical
  Clustering. HFS is also designed to be flexible in supporting a variety of
  policies for managing file data and for managing file system state. This
  flexibility is necessary to support in a scalable fashion the diverse
  workloads we expect for a multiprocessor file system.},
  comment = {This paper is now out of date; see krieger:thesis. Designed for
  scalability on the hierarchical clustering model (see unrau:cluster), the
  Hurricane File System for NUMA shared-memory MIMD machines. Each cluster has
  its own full file system, which communicates with those in other clusters.
  Pieces are name server, open-file server, and block-file server. On first
  access, the file is mapped into the application space. VM system calls BFS to
  arrange transfers. Open questions: policies for file state management, block
  distribution, caching, and prefetching. Object-oriented approach used to
  allow for flexibility and extendability. Local disk file systems are
  log-structured.}
}

@InProceedings{krieger:hfs2,
  author = {Orran Krieger and Michael Stumm},
  title = {{HFS}: A Performance-Oriented Flexible File System Based on
  Building-Block Compositions},
  booktitle = {Proceedings of the Fourth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1996},
  month = {May},
  pages = {95--108},
  publisher = {ACM Press},
  address = {Philadelphia},
  earlier = {krieger:hfs},
  later = {krieger:hfs3},
  keywords = {parallel I/O, parallel file system, object-oriented, pario-bib},
  abstract = {The Hurricane File System (HFS) is designed for (potentially
  large-scale) shared memory multiprocessors. Its architecture is based on the
  principle that, in order to maximize performance for applications with
  diverse requirements, a file system must support a wide variety of file
  structures, file system policies and I/O interfaces. Files in HFS are
  implemented using simple building blocks composed in potentially complex
  ways. This approach yields great flexibility, allowing an application to
  customize the structure and policies of a file to exactly meet its
  requirements. For example, a file's structure can be optimized for concurrent
  random-access write-only operations by ten processes. Similarly, the
  prefetching, locking, and file cache management policies can all be chosen to
  match an application's access pattern. In contrast, most existing parallel
  file systems support a single file structure and a small set of policies.
  \par We have implemented HFS as part of the Hurricane operating system
  running on the Hector shared memory multiprocessor. We demonstrate that the
  flexibility of HFS comes with little processing or I/O overhead. We also show
  that for a number of file access patterns HFS is able to deliver to the
  applications the full I/O bandwidth of the disks on our system.},
  comment = {A published form of krieger:hfs and the thesis krieger:thesis.
  Their main point is that the file system is constructed from building-block
  objects. When you create a file you choose a few building blocks, for
  example, a replication block that mirrors the file, and some distribution
  blocks that distribute each replica across a set of disks. When you open the
  file you plug in some more building blocks, e.g., to do prefetching or to
  provide the kind of interface that you want to use. They point out that this
  flexibility is critical to be able to get good performance, because different
  file-access patterns need different structures and policies. They found that
  mapped files minimize copying costs and improve performance. They were able
  to obtain full disk bandwidth. Great paper.}
}

@Article{krieger:hfs3,
  author = {Orran Krieger and Michael Stumm},
  title = {{HFS}: A Performance-Oriented Flexible File System Based on
  Building-Block Compositions},
  journal = {ACM Transactions on Computer Systems},
  year = {1997},
  month = {August},
  volume = {15},
  number = {3},
  pages = {286--321},
  earlier = {krieger:hfs2},
  URL =
  {http://www.acm.org/pubs/citations/journals/tocs/1997-15-3/p286-krieger/},
  keywords = {parallel I/O, parallel file system, object-oriented, pario-bib},
  abstract = {The Hurricane File System (HFS) is designed for (potentially
  large-scale) shared-memory multiprocessors. Its architecture is based on the
  principle that, in order to maximize performance for applications with
  diverse requirements, a file system must support a wide variety of file
  structures, file system policies, and I/O interfaces. Files in HFS are
  implemented using simple building blocks composed in potentially complex
  ways. This approach yields great flexibility, allowing an application to
  customize the structure and policies of a file to exactly meet its
  requirements. As an extreme example, HFS allows a file's structure to be
  optimized for concurrent random-access write-only operations by 10 threads,
  something no other file system can do. Similarly, the prefetching, locking,
  and file cache management policies can all be chosen to match an
  application's access pattern. In contrast, most parallel file systems support
  a single file structure and a small set of policies. We have implemented HFS
  as part of the Hurricane operating system running on the Hector shared-memory
  multiprocessor. We demonstrate that the flexibility of HFS comes with little
  processing or I/O overhead. We also show that for a number of file access
  patterns, HFS is able to deliver to the applications the full I/O bandwidth
  of the disks on our system.}
}

@PhdThesis{krieger:thesis,
  author = {Orran Krieger},
  title = {{HFS}: A flexible file system for shared-memory multiprocessors},
  year = {1994},
  month = {October},
  school = {University of Toronto},
  URL = {ftp://ftp.cs.toronto.edu/pub/parallel/Okrieg_PhD.ps.Z},
  keywords = {parallel I/O, multiprocesor file system, shared memory,
  memory-mapped I/O, pario-bib},
  abstract = {The Hurricane File System (HFS) is designed for large-scale,
  shared-memory multiprocessors. Its architecture is based on the principle
  that a file system must support a wide variety of file structures, file
  system policies and I/O interfaces to maximize performance for a wide variety
  of applications. HFS uses a novel, object-oriented building-block approach to
  provide the flexibility needed to support this variety of file structures,
  policies, and I/O interfaces. File structures can be defined in HFS that
  optimize for sequential or random access, read-only, write-only or read/write
  access, sparse or dense data, large or small file sizes, and different
  degrees of application concurrency. Policies that can be defined on a
  per-file or per-open instance basis include locking policies, prefetching
  policies, compression/decompression policies and file cache management
  policies. In contrast, most existing file systems have been designed to
  support a single file structure and a small set of policies. \par We have
  implemented large portions of HFS as part of the Hurricane operating system
  running on the Hector shared-memory multiprocessor. We demonstrate that the
  flexibility of HFS comes with little processing or I/O overhead. Also, we
  show that HFS is able to deliver the full I/O bandwidth of the disks on our
  system to the applications.},
  comment = {Excellent work. HFS uses an object-oriented building-block
  approach to provide flexible, scalable high performance. Indeed, HFS appears
  to be one of the most flexible parallel file systems available, allowing
  users to independently control (or redefine) policies for prefetching,
  caching, redundancy and fault tolerance, and declustering.}
}

@TechReport{krystynak:datavault,
  author = {John Krystynak},
  title = {{I/O} Performance on the {Connection Machine DataVault} System},
  year = {1992},
  month = {May},
  number = {RND-92-011},
  institution = {NAS Systems Division, NASA Ames},
  later = {krystynak:pario},
  URL =
  {http://www.nas.nasa.gov/NAS/TechReports/RNDreports/RND-92-011/RND-92-011.html},
  keywords = {parallel I/O, parallel file system, parallel I/O, performance
  measurement, pario-bib},
  comment = {Short measurements of CM-2 Datavault. Faster if you access through
  Paris. Can get nearly full 32 MB/s bandwidth. Problem in its ability to use
  multiple CMIO busses.}
}

@InProceedings{krystynak:pario,
  author = {John Krystynak and Bill Nitzberg},
  title = {Performance Characteristics of the {iPSC/860} and {CM-2} {I/O}
  Systems},
  booktitle = {Proceedings of the Seventh International Parallel Processing
  Symposium},
  year = {1993},
  pages = {837--841},
  publisher = {IEEE Computer Society Press},
  address = {Newport Beach, CA},
  earlier = {krystynak:datavault},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {Essentially a (short) combination of krystynak:datavault and
  nitzberg:cfs.}
}

@InProceedings{kucera:libc,
  author = {Julie Kucera},
  title = {Making {\em libc}\/ Suitable for Use by Parallel Programs},
  booktitle = {Proceedings of the USENIX Distributed and Multiprocessor Systems
  Workshop},
  year = {1989},
  pages = {145--152},
  keywords = {parallel file system interface, pario-bib},
  comment = {Experience making libc reentrant, adding semaphores, etc., on a
  Convex. Some problems with I/O. Added semaphores and private memory to make
  libc calls reentrant, i.e., callable in parallel by multiple threads.}
}

@MastersThesis{kumar:thesis,
  author = {Alok Kumar},
  title = {{SysProView}: A Framework for Visualizing the Activities of
  Multiprocessor File Systems},
  year = {1993},
  school = {Thayer School of Engineering, Dartmouth College},
  keywords = {parallel I/O, pario-bib},
  comment = {A visualization tool, now long gone, for display of CHARISMA trace
  files. See nieuwejaar:workload for details of CHARISMA.}
}

@InProceedings{kuo:blackhole,
  author = {S. Kuo and M. Winslett and Y. Chen and Y. Cho and M. Subramaniam
  and K. Seamons},
  title = {Application experience with parallel input/output: {Panda} and the
  {H3expresso} black hole simulation on the {SP2}},
  booktitle = {Proceedings of the Eighth SIAM Conference on Parallel Processing
  for Scientific Computing},
  year = {1997},
  keywords = {application experience, parallel input/output, parallel I/O,
  performance issues, multiprocessor file system interface, pario-bib},
  abstract = {The paper summarizes our experiences using the Panda parallel I/O
  library with the H3expresso numerical relativity code on the Cornell SP2. Two
  performance issues are described: providing efficient off-loading of output
  data, and satisfying users' desire to dedicate fewer nodes to I/O. We explore
  the tradeoffs between potential solutions, and present performance results
  for our approaches. We also show that Panda's high level interface, which
  allows the user to request input or output of a set of arrays with a single
  command, is a good match for H3expresso's needs}
}

@InProceedings{kuo:efficient,
  author = {S. Kuo and M. Winslett and Y. Cho and J. Lee and Y. Chen},
  title = {Efficient Input and Output for Scientific Simulations},
  booktitle = {Proceedings of the Sixth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1999},
  month = {May},
  pages = {33--44},
  publisher = {ACM Press},
  address = {Atlanta, GA},
  URL = {http://vibes.cs.uiuc.edu/IOPADS/Accepted/Kuo.ps},
  keywords = {scientific computing, simulation, parallel I/O, pario-bib},
  abstract = {Large simulations which run for hundreds of hours on parallel
  computers often periodically generate snapshots of states, which are later
  post-processed to visualize the simulated physical phenomenon. For many
  applications, fast I/O during post-processing, which is dependent on an
  efficient organization of data on disk, is as important as minimizing
  computation-time I/O. In this paper we propose optimizations to support
  efficient parallel I/O for scientific simulations and subsequent
  visualizations. We present an ordering mechanism to linearize data on disk, a
  performance model to help to choose a proper stripe unit size, and a
  scheduling algorithm to minimize communication contention. Our experiments on
  an IBM SP show that the combination of these strategies provides a 20-25\%
  performance boost.}
}

@InProceedings{kurc:query,
  author = {Tahsin Kurc and Chialin Chang and Renato Ferreira and Alan
  Sussman},
  title = {Querying Very Large Multi-dimensional Datasets in {ADR}},
  booktitle = {Proceedings of SC99: High Performance Networking and Computing},
  year = {1999},
  month = {November},
  publisher = {ACM Press and IEEE Computer Society Press},
  address = {Portland, OR},
  URL = {http://www.sc99.org/proceedings/papers/sussman.pdf},
  keywords = {scientific applications, query-based interface, parallel I/O,
  pario-bib},
  comment = {They describe an architecture for accessing data in scientific
  datasets by performing range queries (a multidimensional bounding box) over
  the data. This type of access mechanism is useful for applications like
  satellite imaging.}
}

@InProceedings{kwan:cm5io,
  author = {Thomas T. Kwan and Daniel A. Reed},
  title = {Performance of the {CM-5} Scalable File System},
  booktitle = {Proceedings of the 8th ACM International Conference on
  Supercomputing},
  year = {1994},
  month = {July},
  pages = {156--165},
  publisher = {ACM Press},
  address = {Manchester, UK},
  keywords = {parallel I/O, parallel architecture, multiprocessor file system,
  pario-bib},
  comment = {They measure the performance of the CM-5 Scalable File System
  using synthetic benchmarks. They compare CM-Fortran with CMMD. The
  hardware-dependent (``physical'') modes were much faster than the
  generic-format modes, which have to reorder data between the processor
  distribution and the disk distribution. The network turned out to be a
  bottleneck for the performance when reordering was needed. They conclude that
  more user control over the I/O would be very helpful.}
}

@PhdThesis{kwan:sort,
  author = {Sai Choi Kwan},
  title = {External Sorting: {I/O} Analysis and Parallel Processing
  Techniques},
  year = {1986},
  month = {January},
  school = {University of Washington},
  note = {Available as technical report 86--01--01},
  keywords = {parallel I/O, sorting, pario-bib},
  comment = {Examines external sorting techniques such as merge sort, tag sort,
  multi-pass distribution sort, and one-pass distribution sort. The model is
  one where I/O complexity is included, assuming a linear seek time
  distribution and a cost of 1/2 rotation for each seek. Parallel I/O or
  computing are not considered until the distribution sorts. Architectural
  model on page 58.}
}

@InProceedings{kwong:distribution,
  author = {Peter Kwong and Shikaresh Majumdar},
  title = {Study of Data Distribution Strategies for Parallel {I/O}
  Management},
  booktitle = {Proceedings of the Third International Conference of the
  Austrian Center for Parallel Computation (ACPC)},
  year = {1996},
  month = {September},
  series = {Lecture Notes in Computer Science},
  volume = {1127},
  pages = {12--23},
  publisher = {Springer-Verlag},
  keywords = {parallel I/O, pario-bib},
  abstract = {Recent studies have demonstrated that a significant number of I/O
  operations are performed by a number of classes of different parallel
  applications. Appropriate I/O management strategies are required however for
  harnessing the power of parallel I/O. This paper focuses on two I/O
  management issues that affect system performance in multiprogrammed parallel
  environments. Characterization of I/O behavior of parallel applications in
  terms of four different models is discussed first, followed by an
  investigation of the performance of a number of different data distribution
  strategies. Using computer simulations this research shows that I/O
  characteristics of applications and data distribution have an important
  effect on system performance. Applications that can simultaneously do
  computation and I/O, plus strategies that can incorporate centralized I/O
  management are found to be beneficial for a multiprogrammed parallel
  environment.},
  comment = {See majumdar:management.}
}

@Article{kwong:scheduling,
  author = {Peter Kwong and Shikharesh Majumdar},
  title = {Scheduling of {I/O} in Multiprogrammed Parallel Systems},
  journal = {Informatica},
  year = {1999},
  month = {April},
  volume = {23},
  number = {1},
  pages = {67--76},
  keywords = {parallel I/O, scheduling, pario-bib},
  abstract = {Recent studies have demonstrated that significant I/O is
  performed by a number of parallel applications. In addition to running these
  applications on multiple processors, the parallelization of I/O operations
  and the use of multiple disk drives are required for achieving high system
  performance. This research is concerned with the effective management of
  parallel I/O by using appropriate I/O scheduling strategies. Based on a
  simulation model the performance of a number of scheduling policies are
  investigated. Using I/O characteristics of jobs such as the total outstanding
  I/O demand is observed to be useful in devising effective scheduling
  strategies.}
}

@InProceedings{lake:pario,
  author = {Brian Lake and Chris Gray},
  title = {Parallel {I/O} for {MIMD} Machines},
  booktitle = {Proceedings of SS'93: High Performance Computing},
  year = {1993},
  month = {June},
  pages = {301--308},
  address = {Calgary},
  keywords = {parallel I/O, MIMD, multiprocessor file system, pario-bib},
  comment = {They describe the I/O system for the Myrias SPS-3 parallel
  computer. The SPS is a no-remote-access (NORMA) machine with a software
  shared memory abstraction. They provide a standard C/FORTRAN I/O interface,
  with a few extensions. The user's parallel program is considered a client,
  and an I/O processor (IOP) is the server. No striping across IOPs, which
  makes it relatively simple for them to have the server manage the shared file
  pointer. Their extensions allow atomic, file-pointer update, returning the
  actual position where I/O occurred, and atomic access to fixed- and
  variable-length records. They have three protocols, for different transfer
  sizes; small using simple request/response; medium using sliding window; and
  large using scatter/gather and special hardware double buffering at the IOP.
  They use scatter/gather DMA, and page-table fiddling, for messaging.
  Performance is 89--96\% of hardware peak, limited by IOP's VME backplane.}
}

@Misc{large-scale-memories,
  key = {Algorithmica},
  title = {Special issue on Large-Scale Memories},
  year = {1994},
  volume = {12},
  number = {2},
  howpublished = {Algorithmica},
  keywords = {parallel I/O, algorithms, pario-bib}
}

@Article{latham:mpi-io-scalability,
  author = {Rob Latham and Rob Ross and Rajeev Thakur},
  title = {The impact of file systems on {MPI-IO} scalability},
  journal = {Lecture Notes in Computer Science},
  booktitle = {11th European Parallel Virtural Machine and Message Passing
  Interface Users Group Meeting; September 19-22, 2004; Budapest, HUNGARY},
  editor = {Kranzlmuller, D; Kacsuk, P; Dongarra, J},
  year = {2004},
  month = {November},
  volume = {3241},
  pages = {87--96},
  institution = {Argonne Natl Lab, 9700 S Cass Ave, Argonne, IL 60439 USA;
  Argonne Natl Lab, Argonne, IL 60439 USA},
  publisher = {Springer-Verlag Heidelberg},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL = {http://www.springerlink.com/link.asp?id=m31px2lt90296b62},
  keywords = {scalability analysis, MPI-IO, pario-bib},
  abstract = {As the number of nodes in cluster systems continues to grow,
  leveraging scalable algorithms in all aspects of such systems becomes key to
  maintaining performance. While scalable algorithms have been applied
  successfully in some areas of parallel I/O, many operations are still
  performed in an uncoordinated manner. In this work we consider, in three file
  system scenarios, the possibilities for applying scalable algorithms to the
  many operations that make up the MPI-IO interface. From this evaluation we
  extract a set of file system characteristics that aid in developing scalable
  MPI-IO implementations.}
}

@Article{latham:pvfs2,
  author = {Rob Latham and Neil Miller and Robert Ross and Phil Carns},
  title = {A Next-Generation Parallel File System for Linux Clusters},
  journal = {LinuxWorld},
  year = {2004},
  month = {January},
  volume = {2},
  number = {1},
  keywords = {pvfs2, parallel file system, pario-bib}
}

@Article{latifi:network,
  author = {S. Latifi and M. Moraes de Azevedo and N. Bagherzadeh},
  title = {A star-based {I/O}-bounded network for massively parallel systems},
  journal = {IEE Proceedings--- Computers and Digital Techniques},
  year = {1995},
  month = {January},
  volume = {142},
  number = {1},
  pages = {5--14},
  keywords = {parallel I/O, parallel computer architecture, pario-bib},
  abstract = {The paper describes a new interconnection network for massively
  parallel systems, referred to as star-connected cycles (SCC). The SCC graph
  presents an I/O-bounded structure that results in several advantages over
  variable degree graphs like the star and the hypercube. The description of
  the SCC graph includes issues such as labelling of nodes, degree, diameter
  and symmetry. The paper also presents an optimal routeing algorithm for the
  SCC and efficient broadcasting algorithms with O(n) running time, with n
  being the dimensionality of the graph. A comparison with the cube-connected
  cycles (CCC) and other interconnection networks is included, indicating that,
  for even n, an n-SCC and a CCC of similar sizes have about the same diameter.
  In addition, it is shown that one-port broadcasting in an n-SCC graph can be
  accomplished with a running time better than or equal to that required by an
  n-star containing (n-1) times fewer nodes.}
}

@InProceedings{lauria:server,
  author = {Mario Lauria and Keith Bell and Andrew Chien},
  title = {A High-Performance Cluster Storage Server},
  booktitle = {Proceedings of the Eleventh IEEE International Symposium on High
  Performance Distributed Computing},
  year = {2002},
  pages = {311--320},
  publisher = {IEEE Computer Society Press},
  address = {Edinburgh, Scotland},
  keywords = {srb, performance-related optimization, pario, pario-bib},
  comment = {SRB data transfer optimization on cluster storage servers. If
  disk-bound, the system transfers from server to disks are broken so that
  protocol processing and disk transfer are pipelined. If network bound, stripe
  transfer from multiple clients to multiple servers. No mention of remote
  execution.}
}

@InProceedings{lautenbach:pfs,
  author = {Berin F. Lautenbach and Bradley M. Broom},
  title = {A Parallel File System for the {AP1000}},
  booktitle = {Proceedings of the Third Fujitsu-ANU CAP Workshop},
  year = {1992},
  month = {November},
  keywords = {distributed file system, multiprocessor file system, pario-bib},
  comment = {See also broom:acacia, broom:impl, mutisya:cache, and broom:cap.
  The Acacia file system has file access modes that are much like those in
  Intel CFS and TMC CMMD. By default all processes have their own file pointer,
  but they can switch to another mode either all together or in row- or
  column-subsets. The other modes include a replicated mode (where all read or
  write the same data), and a variety of shared modes, with arbitrary, fixed,
  or unspecified ordering among processors, and with fixed or variable-sized
  records. They also have a parallel-open operation, support for logical
  records, control over the striping width (number of disks) and height (block
  size), and control over of redundancy. A prototype is running.}
}

@Article{lawlor:parity,
  author = {F.~D. Lawlor},
  title = {Efficient mass storage parity recovery mechanism},
  journal = {IBM Technical Disclosure Bulletin},
  year = {1981},
  month = {July},
  volume = {24},
  number = {2},
  pages = {986--987},
  keywords = {parallel I/O, disk array, RAID, pario-bib},
  comment = {An early paper, perhaps the earliest, that describes the
  techniques that later became RAID. Lawlor notes how to use parity to recover
  data lost due to disk crash, as in RAID3, addresses the read-before-write
  problem by caching the old data block as well as the new data block, and
  shows how two-dimensional parity can protect against two or more failures.}
}

@InCollection{lee:bparity,
  author = {Edward K. Lee and Randy H. Katz},
  title = {The Performance of Parity Placements in Disk Arrays},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {3},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {35--54},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {lee:jparity},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {RAID, disk array, reliability, parallel I/O, pario-bib},
  comment = {Part of jin:io-book; reformatted version of lee:jparity.}
}

@InCollection{lee:bpetal,
  author = {Edward K.Lee and Chandramohan A. Thekkath},
  title = {{Petal}: Distributed Virtual Disks},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {27},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {420--430},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {lee:petal},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, distributed file system, declustering, reliability,
  pario-bib},
  comment = {Part of jin:io-book; reformatted version of lee:petal.}
}

@Article{lee:comparison,
  author = {K.~K. Lee and M. Kallahalla and B.~S. Lee and P.~J. Varman},
  title = {Performance Comparison of Prefetching and Placement Policies for
  Parallel {I/O}},
  journal = {International Journal of Parallel and Distributed Systems and
  Networks},
  year = {2002},
  volume = {5},
  number = {2},
  pages = {76--84},
  publisher = {?},
  URL = {http://www.actapress.com/journals/toc/toc2042002.htm#2002vol5issue2},
  keywords = {parallel I/O, file prefetching, pario-bib}
}

@InProceedings{lee:external,
  author = {Jang Sun Lee and Sunghoon Ko and Sanjay Ranka and Byung Eui Min},
  title = {High-Performance External Computations Using User-Controllable
  {I/O}},
  booktitle = {Proceedings of the Joint International Parallel Processing
  Symposium and IEEE Symposium on Parallel and Distributed Processing},
  year = {1998},
  month = {March},
  pages = {303--307},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, pario-bib}
}

@Article{lee:file-assignment,
  author = {Lin-Wen Lee and Peter Scheuermann and Radek Vingralek},
  title = {File Assignment in Parallel {I/O} Systems with Minimal Variance of
  Service Time},
  journal = {IEEE Transactions on Computers},
  year = {2000},
  month = {February},
  volume = {49},
  number = {2},
  pages = {127--140},
  URL = {http://www.computer.org/tc/tc2000/t0127abs.htm},
  keywords = {parallel I/O, parallel file system, pario-bib},
  abstract = {We address the problem of assigning nonpartitioned files in a
  parallel I/O system where the file accesses exhibit Poisson arrival rates and
  fixed service times. We present two new file assignment algorithms based on
  open queuing networks which aim at minimizing simultaneously the load balance
  across all disks, as well as the variance of the service time at each disk.
  We first present an off-line algorithm, Sort Partition, which assigns to each
  disk files with similar access time. Next, we show that, assuming that a
  perfectly balanced file assignment can be found for a given set of files,
  Sort Partition will find the one with minimal mean response time. We then
  present an on-line algorithm, Hybrid Partition, that assigns groups of files
  with similar service times in successive intervals while guaranteeing that
  the load imbalance at any point does not exceed a certain threshold. We
  report on synthetic experiments which exhibit skew in file accesses and sizes
  and we compare the performance of our new algorithms with the vanilla greedy
  file allocation algorithm.}
}

@TechReport{lee:impl,
  author = {Edward K. Lee},
  title = {Software and Performance Issues in the Implementation of a {RAID}
  Prototype},
  year = {1990},
  month = {May},
  number = {UCB/CSD 90/573},
  institution = {EECS, Univ. California at Berkeley},
  URL = {http://cs-tr.cs.berkeley.edu/TR/UCB:CSD-90-573},
  keywords = {parallel I/O, disk striping, performance, pario-bib},
  comment = {Details of their prototype. Defines terms like stripe unit.
  Explores ways to lay out parity. Does performance simulations. Describes ops
  needed in device driver. Good to read if you plan to implement a RAID.
  Results: small R+W, or high loads, don't care about parity placement; in low
  load, there are different best cases for large R+W. Best all-around is
  left-symmetric. See also lee:parity.}
}

@Article{lee:jparity,
  author = {Edward K. Lee and Randy H. Katz},
  title = {The Performance of Parity Placements in Disk Arrays},
  journal = {IEEE Transactions on Computers},
  year = {1993},
  month = {June},
  volume = {42},
  number = {6},
  pages = {651--664},
  publisher = {IEEE Computer Society Press},
  earlier = {lee:parity},
  later = {lee:bparity},
  keywords = {RAID, reliability, parallel I/O, disk striping, pario-bib},
  comment = {Journal version of lee:parity.}
}

@InProceedings{lee:logical-disks,
  author = {Jang Sun Lee and Jungmin Kim and P. Bruce Berra and Sanjay Ranka},
  title = {Logical Disks: User-Controllable {I/O} For Scientific Applications},
  booktitle = {Proceedings of the 1996 IEEE Symposium on Parallel and
  Distributed Processing},
  year = {1996},
  month = {October},
  pages = {340--347},
  publisher = {IEEE Computer Society Press},
  keywords = {logical disks, parallel I/O, pario-bib},
  abstract = {In this paper we propose user-controllable I/O operations and
  explore the effects of them with some synthetic access patterns. The
  operations allow users to determine a file structure matching the access
  patterns, control the layout and distribution of data blocks on physical
  disks, and present various access patterns with a minimum number of I/O
  operations. The operations do not use a file pointer to access data as in
  typical file systems, which eliminates the overhead of managing the offset of
  the file, making it easy to share data and reducing the number of I/O
  operations.}
}

@InProceedings{lee:pario,
  author = {K-K. Lee and P. Varman},
  title = {Prefetching and {I/O} Parallelism in Multiple Disk Systems},
  booktitle = {Proceedings of the 1995 International Conference on Parallel
  Processing},
  year = {1995},
  month = {August},
  pages = {III:160--163},
  publisher = {CRC Press},
  address = {St. Charles, IL},
  keywords = {parallel I/O, prefetching, disk array, pario-bib}
}

@InProceedings{lee:parity,
  author = {Edward K. Lee and Randy H. Katz},
  title = {Performance Consequences of Parity Placement in Disk Arrays},
  booktitle = {Proceedings of the Fourth International Conference on
  Architectural Support for Programming Languages and Operating Systems},
  year = {1991},
  pages = {190--199},
  later = {lee:jparity},
  keywords = {RAID, disk array, reliability, parallel I/O, pario-bib},
  comment = {Interesting comparison of several parity placement schemes. Boils
  down to two basic choices, depending on whether read performance or write
  performance is more important to you.}
}

@InProceedings{lee:petal,
  author = {Edward K. Lee and Chandramohan A. Thekkath},
  title = {Petal: Distributed Virtual Disks},
  booktitle = {Proceedings of the Seventh International Conference on
  Architectural Support for Programming Languages and Operating Systems},
  year = {1996},
  month = {October},
  pages = {84--92},
  address = {Cambridge, MA},
  URL =
  {http://www.research.digital.com/SRC/personal/Chandu_Thekkath/Papers/petal-asplos96.ps},
  keywords = {parallel I/O, distributed file system, declustering, reliability,
  pario-bib},
  comment = {They are trying to build a file server that is easier to manage
  than most of today's distributed file systems, because disks are cheap but
  management is expensive. They describe a distributed file server that spreads
  blocks of all files across many disks and many servers. They use chained
  declustering so that they can survive loss of server or disk. They
  dynamically balance load. They dynamically reconfigure when new virtual disks
  are created or new physical disks are added. They've built it all and are now
  going to look at possible file systems that can take advantage of the
  features of Petal.}
}

@InProceedings{lee:raidmodel,
  author = {Edward K. Lee and Randy H. Katz},
  title = {An Analytic Performance Model of Disk Arrays},
  booktitle = {Proceedings of the 1993 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1993},
  pages = {98--109},
  keywords = {disk array, parallel I/O, RAID, analytic model, pario-bib}
}

@TechReport{lee:redist,
  author = {Jang Sun Lee and Sanjay Ranka and Ravi V. Shankar},
  title = {Communication-Efficient and Memory-Bounded External Redistribution},
  year = {1995},
  institution = {Syracuse University},
  keywords = {parallel I/O algorithm, out-of-core, pario-bib},
  abstract = {This paper presents communication-efficient algorithms for the
  external data redistribution problem. Deterministic lower bounds and upper
  bounds are presented for the number of I/O operations, communication time and
  the memory requirements of external redistribution. Our algorithms differ
  from most other algorithms presented for out-of-core applications in that it
  is optimal (within a small constant factor) not only in the number of I/O
  operations, but also in the time taken for communication. A coarse-grained
  MIMD architecture with I/O subsystems attached to each processor is assumed,
  but the results are expected to be applicable over a wider variety of
  architectures.},
  comment = {See shankar:transport for the underlying communication
  primitives.}
}

@InProceedings{lee:support,
  author = {Jenq Kuen Lee and Ing-Kuen Tsaur and San-Yih Huang},
  title = {Language and Environmental Support for Parallel Object {I/O} on
  Distributed Memory Environments},
  booktitle = {Proceedings of the Seventh SIAM Conference on Parallel
  Processing for Scientific Computing},
  year = {1995},
  month = {February},
  pages = {756--761},
  publisher = {SIAM},
  keywords = {parallel I/O, object oriented, distributed memory, pario-bib},
  abstract = {The paper describes a parallel file object environment to support
  distributed array store on shared nothing distributed computing environments.
  Our environment enables programmers to extend the concept of array
  distribution from memory levels to file levels. It allows parallel I/O
  according to the distribution of objects in an application. When objects are
  read and/or written by multiple applications using different distributions,
  we present a novel scheme to help programmers to select the best data
  distribution pattern according to minimum amount of remote data movements for
  the store of array objects on distributed file systems.}
}

@InProceedings{lee:userio,
  author = {Jang Sun Lee and Sang-Gue Oh and Bruce P. Berra and Sanjay Ranka},
  title = {User-Controllable {I/O} for Parallel Computers},
  booktitle = {International Conference on Parallel and Distributed Processing
  Techniques and Applications (PDPTA~'96)},
  year = {1996},
  month = {August},
  pages = {442--453},
  keywords = {parallel I/O, pario-bib},
  abstract = {This paper presents the design of UPIO, a software for
  user-controllable parallel input and output. UPIO is designed to maximize I/O
  performance for scientific applications on MIMD multicomputers. The most
  important features of UPIO are: It supports a domain-specific file model and
  a variety of application interfaces to present numerous access patterns. UPIO
  provides user-contollerable I/O operations that allow users to control data
  access, file structure, and data distribution. The domain-specific file model
  and user controllability give low I/O overhead and allow programmers to
  exploit the aggregate bandwidth of parallel disks.},
  comment = {They describe an interface that seems to allow easier access for
  programmers that want to map matrices onto parallel files. The concepts are
  not well explained, so it's hard to really understand what is new and
  different. They make no explicit comparison with other advanced interfaces
  like that in Vesta or Galley. No performance results.}
}

@TechReport{leon:dfs,
  author = {Christopher S. Leon},
  title = {An Implementation of External-Memory Depth-First Search},
  year = {1998},
  month = {June},
  number = {PCS-TR98-333},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {Christopher S. Leon},
  URL = {https://digitalcommons.dartmouth.edu/senior_theses/186/},
  keywords = {out-of-core algorithm, parallel I/O, pario-bib},
  abstract = {In many different areas of computing, problems can arise which
  are too large to fit in main memory. For these problems, the I/O cost of
  moving data between main memory and secondary storage (for example, disks)
  becomes a significant bottleneck affecting the performance of the program.
  \par Since most algorithms do not take into account the size of main memory,
  new algorithms have been developed to optimize the number of I/O's performed.
  This paper details the implementation of one such algorithm, for
  external-memory depth-first search. \par Depth-first search is a basic tool
  for solving many problems in graph theory, and since graph theory is
  applicable for many large computational problems, it is important to make
  sure that such a basic tool is designed to avoid the bottleneck of main
  memory to secondary storage I/O's. \par The algorithm whose implementation is
  described in this paper is sketched out in an extended abstract by Chiang et
  al. We attempt to improve the given algorithm by minimizing I/O's performed,
  and to extend the algorithm by finding disjoint trees, and by classifying all
  the edges in the problem.},
  comment = {Senior honors thesis. Advisor: Tom Cormen.}
}

@Article{lepper:cfd,
  author = {J. Lepper and U. Schnell and K.R.G. Hein},
  title = {Parallelization of a simulation code for reactive flows on the
  {Intel Paragon}},
  journal = {Computers and Mathematics with Applications},
  year = {1998},
  month = {April},
  volume = {35},
  number = {7},
  pages = {101--109},
  publisher = {Pergamon-Elsevier Science Ltd},
  keywords = {parallel I/O, application, pario-bib},
  abstract = {The paper shows the implementation of a 3D simulation code for
  turbulent flow and combustion processes in full-scale utility boilers on an
  Intel Paragon XP/S computer. For the portable parallelization, an explicit
  approach is chosen using a domain decomposition method for the static
  subdivision of the numerical grid together with the SPMD programming model.
  The measured speedup for the presented case using a coarse grid is good,
  although some numerical requirements restrict the implemented message passing
  to strongly synchronized communication. On the Paragon, the NX message
  passing library is used for the computations. Furthermore, MPI and PVM are
  applied and their pros and cons on this computer are described. In addition
  to the basic message passing techniques for local and global communication,
  other possibilities are investigated. Besides the applicability of the
  vectorizing capability of the compiler, the influence of the I/O performance
  during computations is demonstrated. The scalability of the parallel
  application is presented for a refined discretization.}
}

@Article{li:bfxm,
  author = {Qun Li and Jie Jing and Li Xie},
  title = {{BFXM}: A Parallel File System Model Based on the Mechanism of
  Distributed Shared Memory},
  journal = {ACM Operating Systems Review},
  year = {1997},
  month = {October},
  volume = {31},
  number = {4},
  pages = {30--40},
  URL = {http://doi.acm.org/10.1145/271019.271025},
  keywords = {parallel file system, distributed shared memory, DSM, COMA,
  pario-bib},
  comment = {Basically, cooperative shared memory with a backing store.}
}

@Article{li:jmodels,
  author = {Zhiyong Li and Peter H. Mills and John H. Reif},
  title = {Models and Resource Metrics for Parallel and Distributed
  Computation},
  journal = {Parallel Algorithms and Applications},
  year = {1996},
  volume = {8},
  pages = {35--59},
  earlier = {li:models},
  keywords = {parallel I/O algorithm, pario-bib}
}

@InProceedings{li:models,
  author = {Zhiyong Li and Peter H. Mills and John H. Reif},
  title = {Models and Resource Metrics for Parallel and Distributed
  Computation},
  booktitle = {Proceedings of the Twenty-Eighth Annual Hawaii International
  Conference on System Sciences},
  year = {1995},
  month = {January},
  pages = {51--60},
  address = {Hawaii},
  later = {li:jmodels},
  URL =
  {file://ftp.cs.unc.edu/pub/projects/proteus/reports/models_hicss95.ps.gz"},
  keywords = {parallel I/O algorithm, pario-bib},
  abstract = {This paper presents a framework of using {\em resource metrics}
  to characterize the various models of parallel computation. Our framework
  reflects the approach of recent models to abstract architectural details into
  several generic parameters, which we call resource metrics. We examine the
  different resource metrics chosen by different parallel models, categorizing
  the models into four classes: the basic synchronous models, and extensions of
  the basic models which more accurately reflect practical machines by
  incorporating notions of asynchrony, communication cost and memory hierarchy.
  We then present a new parallel computation model, the LogP-HMM model, as an
  illustration of design principles based on the framework of resource metrics.
  The LogP-HMM model extends an existing parameterized network model (LogP)
  with a sequential hierarchical memory model (HMM) characterizing each
  processor. The result accurately captures both network communication costs
  and the effects of multileveled memory such as local cache and I/O. We
  examine the potential utility of our model in the design of near optimal
  sorting and FFT algorithms.}
}

@TechReport{li:recursive-tr,
  author = {Zhiyong Li and John H. Reif and Sandeep K. S. Gupta},
  title = {Synthesizing Efficient Out-of-Core Programs for Block Recursive
  Algorithms using Block-Cyclic Data Distributions},
  year = {1996},
  month = {March},
  number = {96-04},
  institution = {Dept. of Computer Science, Duke University},
  later = {li:recursive},
  URL = {ftp://ftp.cs.duke.edu/pub/zli/papers/TR-96-04.ps.gz},
  keywords = {parallel I/O, out-of-core algorithm, pario-bib},
  abstract = {In this paper, we present a framework for synthesizing I/O
  efficient out-of-core programs for block recursive algorithms, such as the
  fast Fourier transform (FFT) and block matrix transposition algorithms. Our
  framework uses an algebraic representation which is based on tensor products
  and other matrix operations. The programs are optimized for the striped
  Vitter and Shriver's two-level memory model in which data can be distributed
  using various cyclic(B) distributions in contrast to the normally used {\it
  physical track} distribution cyclic(B_d), where B_d is the physical disk
  block size. \par We first introduce tensor bases to capture the semantics of
  block-cyclic data distributions of out-of-core data and also data access
  patterns to out-of-core data. We then present program generation techniques
  for tensor products and matrix transposition. We accurately represent the
  number of parallel I/O operations required for the synthesized programs for
  tensor products and matrix transposition as a function of tensor bases and
  data distributions. We introduce an algorithm to determine the data
  distribution which optimizes the performance of the synthesized programs.
  Further, we formalize the procedure of synthesizing efficient out-of-core
  programs for tensor product formulas with various block-cyclic distributions
  as a dynamic programming problem. \par We demonstrate the effectiveness of
  our approach through several examples. We show that the choice of an
  appropriate data distribution can reduce the number of passes to access
  out-of-core data by as large as eight times for a tensor product, and the
  dynamic programming approach can largely reduce the number of passes to
  access out-of-core data for the overall tensor product formulas.}
}

@InProceedings{li:synthesizing,
  author = {Zhiyong Li and John H. Reif and Sandeep K. S. Gupta},
  title = {Synthesizing Efficient Out-of-Core Programs for Block Recursive
  Algorithms using Block-Cyclic Data Distributions},
  booktitle = {Proceedings of the 1996 International Conference on Parallel
  Processing},
  year = {1996},
  month = {August},
  pages = {II:142--149},
  publisher = {IEEE Computer Society Press},
  address = {St. Charles, IL},
  earlier = {li:synthesizing-tr},
  keywords = {parallel I/O algorithm, pario-bib},
  abstract = {This paper presents a framework for synthesizing I/O-efficient
  out-of-core programs for block recursive algorithms, such as the fast Fourier
  transform and matrix transpositions. the programs are synthesized from tensor
  (Kronecker) product representations of algorithms. These programs are
  optimized for a striped two-level memory model where in the out-of-core data
  can have block-cyclic distributions on multiple disks.}
}

@TechReport{li:synthesizing-tr,
  author = {Zhiyong Li and John H. Reif and Sandeep K. S. Gupta},
  title = {Synthesizing Efficient Out-of-Core Programs for Block Recursive
  Algorithms using Block-Cyclic Data Distributions},
  year = {1996},
  month = {March},
  number = {TR-96-04},
  institution = {Dept. of Computer Science, Duke University},
  later = {li:synthesizing},
  URL = {ftp://ftp.cs.duke.edu/pub/zli/papers/TR-96-04.ps.gz},
  keywords = {parallel I/O algorithm, pario-bib},
  abstract = {In this paper, we present a framework for synthesizing I/O
  efficient out-of-core programs for block recursive algorithms, such as the
  fast Fourier transform (FFT) and block matrix transposition algorithms. Our
  framework uses an algebraic representation which is based on tensor products
  and other matrix operations. The programs are optimized for the striped
  Vitter and Shriver's two-level memory model in which data can be distributed
  using various cyclic(B) distributions in contrast to the normally used {\it
  physical track} distribution cyclic(B_d), where B_d is the physical disk
  block size. \par We first introduce tensor bases to capture the semantics of
  block-cyclic data distributions of out-of-core data and also data access
  patterns to out-of-core data. We then present program generation techniques
  for tensor products and matrix transposition. We accurately represent the
  number of parallel I/O operations required for the synthesized programs for
  tensor products and matrix transposition as a function of tensor bases and
  data distributions. We introduce an algorithm to determine the data
  distribution which optimizes the performance of the synthesized programs.
  Further, we formalize the procedure of synthesizing efficient out-of-core
  programs for tensor product formulas with various block-cyclic distributions
  as a dynamic programming problem. \par We demonstrate the effectiveness of
  our approach through several examples. We show that the choice of an
  appropriate data distribution can reduce the number of passes to access
  out-of-core data by as large as eight times for a tensor product, and the
  dynamic programming approach can largely reduce the number of passes to
  access out-of-core data for the overall tensor product formulas.}
}

@InProceedings{liao:overlapping,
  author = {Wei-keng Liao and Alok Choudhary and Kenin Coloma and Lee Ward and
  Eric Russell and Neil Pundit},
  title = {Scalable implementations of {MPI} atomicity for concurrent
  overlapping {I/O}},
  booktitle = {Proceedings of the 2003 International Conference on Parallel
  Processing},
  editor = {Sadayappan, P; Yang, CS},
  year = {2003},
  month = {October},
  pages = {239--246},
  institution = {Northwestern Univ, ECE Dept, Evanston, IL 60208 USA},
  publisher = {IEEE Computer Society Press},
  copyright = {(c)2005 The Thomson Corporation},
  address = {Kaohsiung, Taiwan},
  keywords = {MPI, concurrent I/O operations, overlapping write access, atomic
  I/O operations, pario-bib},
  abstract = {For concurrent I/O operations, atomicity defines the results in
  the overlapping file regions simultaneously read/written by requesting
  processes. Atomicity has been well studied at the file system level, such as
  POSIX standard. In this paper, we investigate the problems arising from the
  implementation of MPI atomicity for concurrent overlapping write access and
  provide two programming solutions. Since the MPI definition of atomicity
  differs from the POSIX one, an implementation that simply relies on the POSIX
  file systems does not guarantee correct MPI semantics. To have a correct
  implementation of atomic I/O in MPI, we examine the efficiency of three
  approaches: 1) file locking, 2) graph-coloring, and 3) process-rank ordering.
  Performance complexity for these methods are analyzed and their experimental
  results are presented for file systems including NFS, SGI's XFS, and IBM's
  GPFS.}
}

@InProceedings{ligon:pfs,
  author = {W. B. Ligon and R. B. Ross},
  title = {Implementation and Performance of a Parallel File System for High
  Performance Distributed Applications},
  booktitle = {Proceedings of the Fifth IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1996},
  month = {August},
  pages = {471--480},
  publisher = {IEEE Computer Society Press},
  URL =
  {http://ieeexplore.ieee.org:80/xpl/tocresult.jsp?isNumber=11475&page=3},
  keywords = {parallel I/O, cluster computing, parallel file system,
  pario-bib},
  abstract = {Dedicated cluster parallel computers (DCPCs) are emerging as
  low-cost high performance environments for many important applications in
  science and engineering. A significant class of applications that perform
  well on a DCPC are coarse-grain applications that involve large amounts of
  file I/O. Current research in parallel file systems for distributed systems
  is providing a mechanism for adapting these applications to the DCPC
  environment. We present the Parallel Virtual File System (PVFS), a system
  that provides disk striping across multiple nodes in a distributed parallel
  computer and file partitioning among tasks in a parallel program. PVFS is
  unique among similar systems in that it uses a stream-based approach that
  represents each file access with a single set of request parameters and
  decouples the number of network messages from details of the file striping
  and partitioning. PVFS also provides support for efficient collective file
  accesses and allows overlapping file partitions. We present results of early
  performance experiments that show PVFS achieves excellent speedups in
  accessing moderately sized file segments.}
}

@InProceedings{lin:clusterio,
  author = {Zheng Lin and Songnian Zhou},
  title = {Parallelizing {I/O} Intensive Applications for a Workstation
  Cluster: a Case Study},
  booktitle = {Proceedings of the IPPS~'93 Workshop on Input/Output in Parallel
  Computer Systems},
  year = {1993},
  pages = {17--36},
  address = {Newport Beach, CA},
  note = {Also published in Computer Architecture News 21(5), December 1993,
  pages 15--22},
  keywords = {parallel I/O, workstation cluster, text retrieval, pario-bib},
  comment = {They implement a parallel text retrieval application on a cluster
  of DEC~5000 workstations.}
}

@Article{lin:optimizing,
  author = {Yih-Fang Lin and Chien-Min Wang and Jan-Jan Wu},
  title = {Optimizing I/O server placement for parallel I/O on switch-based
  irregular networks},
  journal = {Lecture Notes in Computer Science},
  booktitle = {2nd International Symposium on Parallel and Distributed
  Processing and Applications; December 13-15, 2004; Hong Kong, PEOPLES R
  CHINA},
  editor = {Cao, J; Yang, LT; Guo, M; Lau, F},
  year = {2004},
  month = {November},
  volume = {3358},
  pages = {997--1006},
  institution = {Acad Sinica, Inst Sci Informat, Taipei 115, Taiwan; Natl
  Taiwan Univ, Dept Comp Sci \& Informat Engn, Taipei 10764, Taiwan},
  publisher = {SPRINGER-VERLAG BERLIN},
  copyright = {(c)2005 The Thomson Corporation},
  URL =
  {http://www.springerlink.com/openurl.asp?genre=article&issn=0302-9743&volume=3358&spage=997},
  keywords = {I/O server placement, network topologies, switch-based clusters,
  pario-bib},
  abstract = {In this paper, we study I/O server placement for optimizing
  parallel I/O performance on switch-based clusters, which typically adopt
  irregular network topologies to allow construction of scalable systems with
  incremental expansion capability. Finding optimal solution to this problem is
  computationally intractable. We quantified the number of messages travelling
  through each network link by a workload function, and developed three
  heuristic algorithms to find good solutions based on the values of the
  workload function. Our simulation results demonstrate performance advantage
  of our algorithms over a number of algorithms commonly used in existing
  parallel systems. In particular, the load-balance-based algorithm is superior
  to the other algorithms in most cases, with improvement ratio of 10% to 95  in terms of parallel I/O throughput.}
}

@InCollection{litwin:LSA,
  author = {Witold Litwin and Jai Menon},
  title = {Scalable Distributed Log Structured Arrays},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {8},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {107--116},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {disk array, log-structured file system, RAID, parallel I/O,
  pario-bib},
  comment = {Part of jin:io-book.}
}

@Article{liu:pario-interface,
  author = {X. Liu},
  title = {The Performance Research of the Distributed Parallel Server System
  with Distributed Parallel {I/O} Interface},
  journal = {Acta Electronica Sinica},
  year = {2002},
  volume = {30},
  number = {12},
  pages = {1808--1810},
  publisher = {Chinese Institute of Elecronics Beijing},
  keywords = {parallel I/O, pario-bib}
}

@InProceedings{livny:stripe,
  author = {M. Livny and S. Khoshafian and H. Boral},
  title = {Multi-Disk Management Algorithms},
  booktitle = {Proceedings of the 1987 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1987},
  month = {May},
  pages = {69--77},
  keywords = {parallel I/O, disk striping, disk array, pario-bib}
}

@TechReport{lo:disks,
  author = {Raymond Lo and Norman Matloff},
  title = {A Probabilistic Limit on the Virtual Size of Replicated File
  Systems},
  year = {1989},
  institution = {Department of EE and CS, UC Davis},
  keywords = {parallel I/O, replication, file system, disk mirroring, disk
  shadowing, pario-bib},
  comment = {A look at shadowed disks. If you have $k$ disks set up to read
  from the disk with the shortest seek, but write to all disks, you have
  increased reliability, read time like the min of the seeks, and write time
  like the max of the seeks. It appears that with increasing $k$ you can get
  good performance. But this paper clearly shows, since writes move all disk
  heads to the same location, that the effective value of $k$ is actually quite
  low. Only 4--10 disks are likely to be useful for most traffic loads.}
}

@Article{lockey:characterization,
  author = {P. Lockey and R. Proctor and I. D. James},
  title = {Characterization of {I/O} Requirements in a Massively Parallel Shelf
  Sea Model},
  journal = {The International Journal of High Performance Computing
  Applications},
  year = {1998},
  month = {Fall},
  volume = {12},
  number = {3},
  pages = {320--332},
  keywords = {parallel I/O application, pario-bib},
  abstract = {It is now recognized that a high level of I/O performance is
  crucial in making effective use of parallel machines for many scientific
  application codes. This paper considers the I/O requirements in one
  particular scientific application area; 3D modelling of continental shelf sea
  regions. We identify some of the scientific aims which drive the model
  development, and the consequent impact on the I/O needs. As a case study we
  take a parallel production code running a simulation of the North Sea on a
  Cray~T3D platform and investigate the I/O performance in dealing with the
  dominant I/O component; dumping of results data to disk. In order to place
  the performance issues in a more general framework we construct a simple
  theoretical model of I/O requirements, and use this to probe the impact of
  available I/O performance on current and proposed scientific objectives.},
  comment = {In a Special Issue on I/O in Parallel Applications, volume 12,
  numbers 3 and 4.}
}

@Article{long:swift-raid,
  author = {Darrell D. E. Long and Bruce R. Montague},
  title = {{Swift/RAID}: A Distributed {RAID} System},
  journal = {Computing Systems},
  year = {1994},
  month = {Summer},
  volume = {7},
  number = {3},
  pages = {333--359},
  keywords = {RAID, disk array, parallel I/O, distributed file system,
  pario-bib},
  comment = {One of the features of this system is the way they develop and
  execute transaction plans as little scripts that are built by the client,
  sent to the servers, and then executed by interpreters.}
}

@InProceedings{loverso:sfs,
  author = {Susan J. LoVerso and Marshall Isman and Andy Nanopoulos and William
  Nesheim and Ewan D. Milne and Richard Wheeler},
  title = {{\em sfs}: {A} Parallel File System for the {CM-5}},
  booktitle = {Proceedings of the 1993 Summer USENIX Technical Conference},
  year = {1993},
  pages = {291--305},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {They took the Unix file system from SunOS and extended it to run
  on the CM-5. This involved handling non-power-of-two block sizes, parallel
  I/O calls, large file sizes, and more encouragement for extents to be
  allocated. The hardware is particularly suited to RAID~3 with a 16 byte
  striping unit, although in theory the software could do anything it wants.
  Geared to data-parallel model. Proc nodes (PNs) contact the timesharing
  daemon (TSD) on the control processor (CP), who gets block lists from the
  file system, which runs on one of the CPs. The TSD then arranges with the
  disk storage nodes (DSNs) to do the transfer directly with the PNs. Each DSN
  has 8~MB of buffer space, 8 disk drives, 4 SCSI busses, and a SPARC as
  controller. Partition managers mount non-local sfs via NFS. Performance
  results good. Up to 185~MB/s on 118 (2~MB/s) disks.}
}

@InProceedings{lumb:facade,
  author = {Christopher R. Lumb},
  title = {Fa\c{c}ade: Virtual Storage Devices with Performance Guarantees},
  booktitle = {Proceedings of the USENIX FAST '03 Conference on File and
  Storage Technologies},
  year = {2003},
  month = {April},
  publisher = {USENIX Association},
  address = {San Francisco, CA},
  URL = {http://www.usenix.org/events/fast03/tech/lumb.html},
  keywords = {file systems, qos, quality of service, pario-bib},
  abstract = {High-end storage systems, such as those in large data centers,
  must service multiple independent workloads. Workloads often require
  predictable quality of service, despite the fact that they have to compete
  with other rapidly-changing workloads for access to common storage resources.
  We present a novel approach to providing performance guaran-tees in this
  highly-volatile scenario, in an efficient and cost-effective way. Fa\c{c}ade,
  a virtual store controller, sits between hosts and storage devices in the
  network, and throttles individual I/O requests from multiple clients so that
  devices do not saturate. We implemented a prototype, and evaluated it using
  real workloads on an enterprise storage system. We also instantiated it to
  the particular case of emulating commercial disk arrays. Our results show
  that Fa\c{c}ade satisfies performance objectives while making efficient use
  of the storage resources-even in the presence of failures and bursty
  workloads with stringent performance requirements.}
}

@InProceedings{lyster:geos-das,
  author = {P.M. Lyster and K. Ekers and J. Guo and M. Harber and D. Lamich and
  J.W. Larson and R. Lucchesi and R. Rood and S. Schubert and W. Sawyer and M.
  Sienkiewicz and A. da Silva and J. Stobie and L.L. Takacs and R. Todling and
  J. Zero},
  title = {Parallel Computing at the {NASA} Data Assimilation Office ({DAO})},
  booktitle = {Proceedings of SC97: High Performance Networking and Computing},
  year = {1997},
  month = {November},
  publisher = {IEEE Computer Society Press},
  address = {San Jose, CA},
  URL = {http://dao.gsfc.nasa.gov/DAO_people/lys/sc97/sc97/INDEX.HTML},
  keywords = {parallel I/O, pario-bib},
  comment = {This paper is about a NASA project GEOS-DAS (Goddard Earth
  Observing System-Data Assimilation System). The goal of the project is to
  produce ''accurate gridded datasets of atmospheric fields''. The data will be
  used by meteorologists for weather analysis and forecasts as well as being a
  tool for climate research. This paper discusses their plans to parallelize
  the core code of the system. They include a section on parallel I/O.}
}

@InProceedings{ma:buffering,
  author = {Xiasong Ma and Marianne Winslett and Jonghyun Lee and Shengke Yu},
  title = {Improving {MPI IO} Output Performance with Active Buffering Plus
  Threads},
  booktitle = {Proceedings of the International Parallel and Distributed
  Processing Symposium},
  year = {2003},
  month = {April},
  publisher = {IEEE Computer Society Press},
  URL = {http://drl.cs.uiuc.edu/pubs/abt.pdf},
  keywords = {parallel I/O, pario-bib},
  abstract = {Efficient collective output of intermediate results to secondary
  storage becomes more and more important for scientific simulations as the gap
  between process-ing power/interconnection bandwidth and the I/O sys-tem
  bandwidth enlarges. Dedicated servers can offload I/O from compute processors
  and shorten the execution time, but it is not always possible or easy for an
  appli-cation to use them. We propose the use of active buffer-ing with
  threads (ABT) for overlapping I/O with com-putation efficiently and flexibly
  without dedicated I/O servers. We show that the implementation of ABT in
  ROMIO, a popular implementation of MPI-IO, greatly reduces the
  application-visible cost of ROMIO s collec-tive write calls, and improves an
  application s overall performance by hiding I/O cost and saving implicit
  syn-chronization overhead from collective write operations. Further, ABT is
  high-level, platform-independent, and transparent to users, giving users the
  benefit of over-lapping I/O with other processing tasks even when the file
  system or parallel I/O library does not support asyn-chronous I/O.}
}

@InProceedings{ma:flexible,
  author = {Xiasong Ma and Xiangmin Jiao and Michael Campbell oand Marianne
  Winslett},
  title = {Flexible and Efficient Parallel {I/O} for Large-Scale
  Multi-component Simulations},
  booktitle = {Proceedings of the Fourth Workshop on Parallel and Distributed
  Scientific and Engineering Computing with Applications},
  year = {2003},
  month = {April},
  publisher = {IEEE Computer Society Press},
  URL = {http://drl.cs.uiuc.edu/pubs/pdseca.html},
  keywords = {parallel I/O, pario-bib},
  abstract = {In this paper, we discuss our experience of providing high
  performance parallel I/O for a large-scale, on-going, multi-disciplinary
  simulation project for solid propellant rockets. We describe the performance
  and data management issues observed in this project and present our
  solutions, including (1) support for relatively fine-grained distribution of
  irregular datasets in parallel I/O, (2) a flexible data management facility
  for inter-module communication, and (3) two schemes to overlap computation
  with I/O. Performance results obtained from the rocket simulation's
  development and production platforms show that our I/O optimizations can
  dramatically reduce the simulation's visible I/O cost, as well as the number
  of disk files, and significantly improve the overall performance. Meanwhile,
  our data management facility helps to provide simulation developers with
  simple user interfaces for parallel I/O.}
}

@InProceedings{mache:spatial,
  author = {Jens Mache and Virginia Lo and Marilynn Livingston and Sharad
  Garg},
  title = {The Impact of Spatial Layout of Jobs on Parallel {I/O} Performance},
  booktitle = {Proceedings of the Sixth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1999},
  month = {May},
  pages = {45--56},
  publisher = {ACM Press},
  address = {Atlanta, GA},
  URL = {http://vibes.cs.uiuc.edu/IOPADS/Accepted/Mache.ps},
  keywords = {parallel I/O, pario-bib},
  abstract = {Input/Output is a big obstacle to effective use of
  teraflops-scale computing systems. Motivated by earlier parallel I/O
  measurements on an Intel TFLOPS machine, we conduct studies to determine the
  sensitivity of parallel I/O performance on multi-programmed mesh-connected
  machines with respect to number of I/O nodes, number of compute nodes,
  network link bandwidth, I/O node bandwidth, spatial layout of jobs, and read
  or write demands of applications. \par Our extensive simulations and
  analytical modeling yield important insights into the limitations on parallel
  I/O performance due to network contention, and into the possible gains in
  parallel I/O performance that can be achieved by tuning the spatial layout of
  jobs. \par Applying these results, we devise a new processor allocation
  strategy that is sensitive to parallel I/O traffic and the resulting network
  contention. In performance evaluations driven by synthetic workloads and by a
  real workload trace captured at the San Diego Supercomputing Center, the new
  strategy improves the average response time of parallel I/O intensive jobs by
  up to a factor of 4.5.}
}

@InProceedings{maciel:dgw,
  author = {Frederico B. Maciel and Nobutoshi Sagawa and Teruo Tanaka},
  title = {{Dynamic Gateways}: A Novel Approach to Improve Networking
  Performance and Availability on Parallel Servers},
  booktitle = {Proceedings of the High-Performance Computing and Networking
  Symposium (HPCN'98)},
  year = {1998},
  pages = {678--687},
  URL = {http://www2.neweb.ne.jp/wd/fbm},
  keywords = {parallel networking, network I/O, parallel I/O, pario-bib},
  abstract = {Parallel servers realize scalability and availability by
  effectively using multiple hardware resources (i.e., nodes and disks).
  Scalability is improved by distributing processes and data onto multiple
  resources; and availability is maintained by substituting a failed resource
  with a spare one. {\em Dynamic Gateways\/} extends these features to
  networking, by balancing the traffic among multiple connections to the
  network in order to improve scalability, and detours traffic around failed
  resources to maintain availability. This is made transparent to the clients
  and to applications in the server by using proxy and gratuitous ARP to
  control the network traffic. A performance evaluation shows that Dynamic
  Gateways improves the scalability (allowing the maximum networking
  performance to increase with increasing number of connections) and the
  performance (improving throughput and reducing access latency).},
  comment = {Contact fred-m@crl.hitachi.co.jp, sagawa@crl.hitachi.co.jp, or
  tetanaka@kanagawa.hitachi.co.jp.}
}

@Article{mackay:groundwater,
  author = {David Mackay and G. Mahinthakumar and Ed D'Azevedo},
  title = {A Study of {I/O} in a Parallel Finite Element Groundwater Transport
  Code},
  journal = {The International Journal of High Performance Computing
  Applications},
  year = {1998},
  month = {Fall},
  volume = {12},
  number = {3},
  pages = {307--319},
  keywords = {parallel I/O application, pario-bib},
  abstract = {A parallel finite element groundwater transport code is used to
  compare three different strategies for performing parallel I/O: (1) have a
  single processor collect data and perform sequential I/O in large blocks, (2)
  use variations of vendor specific I/O extensions (3) use the EDONIO I/O
  library. Each processor performs many writes of one to four kilobytes to
  reorganize localdata in a global shared file. Our findings suggest having a
  single processor collect data and perform large block-contiguous operations
  may be quite efficient and portable for up to 32 processor configurations.
  This approach does not scale well for a larger number of processors since the
  single processor becomes a bottleneck for gathering data. The effective
  application I/O rate observed, which includes times for opening and closing
  files, is only a fraction of the peak device read/write rates. Some form of
  data redistribution and buffering in remote memory as performed in EDONIO may
  yield significant improvements for non-contiguous data I/O access patterns
  and short requests. Implementors of parallel I/O systems may consider some
  form of buffering as performed in EDONIO to speed up such I/O requirements.},
  comment = {In a Special Issue on I/O in Parallel Applications, volume 12,
  numbers 3 and 4.}
}

@InProceedings{madhyastha:adaptive,
  author = {Tara M. Madhyastha and Daniel A. Reed},
  title = {Intelligent, Adaptive File System Policy Selection},
  booktitle = {Proceedings of the Sixth Symposium on the Frontiers of Massively
  Parallel Computation},
  year = {1996},
  month = {October},
  pages = {172--179},
  publisher = {IEEE Computer Society Press},
  later = {madhyastha:thesis},
  keywords = {parallel I/O, pario-bib},
  abstract = {Traditionally, maximizing input/output performance has required
  tailoring application input/output patterns to the idiosyncrasies of specific
  input/output systems. The authors show that one can achieve high application
  input/output performance via a low overhead input/output system that
  automatically recognizes file access patterns and adaptively modifies system
  policies to match application requirements. This approach reduces the
  application developer's input/output optimization effort by isolating
  input/output optimization decisions within a retargetable file system
  infrastructure. To validate these claims, they have built a lightweight file
  system policy testbed that uses a trained learning mechanism to recognize
  access patterns. The file system then uses these access pattern
  classifications to select appropriate caching strategies, dynamically
  adapting file system policies to changing input/output demands throughout
  application execution. The experimental data show dramatic speedups on both
  benchmarks and input/output intensive scientific applications.},
  comment = {See also madhyastha:thesis, and related papers.}
}

@InProceedings{madhyastha:classification,
  author = {Tara M. Madhyastha and Daniel A. Reed},
  title = {Input/Output Access Pattern Classification Using Hidden {Markov}
  Models},
  booktitle = {Proceedings of the Fifth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1997},
  month = {November},
  pages = {57--67},
  publisher = {ACM Press},
  address = {San Jose, CA},
  later = {madhyastha:thesis},
  URL = {http://doi.acm.org/10.1145/266220.266226},
  keywords = {workload characterization, file access pattern, parallel I/O,
  pario-bib},
  abstract = {Input/output performance on current parallel file systems is
  sensitive to a good match of application access pattern to file system
  capabilities. Automatic input/output access classification can determine
  application access patterns at execution time, guiding adaptive file system
  policies. In this paper we examine a new method for access pattern
  classification that uses hidden Markov models, trained on access patterns
  from previous executions, to create a probabilistic model of input/output
  accesses. We compare this approach to a neural network classification
  framework, presenting performance results from parallel and sequential
  benchmarks and applications.},
  comment = {The most interesting thing in this paper is the use of a Hidden
  Markov Model to understand the access pattern of an application to a file.
  After running the application on the file once, and simultaneously training
  their HMM, they use the result to tune the system for the next execution
  (cache size, cache partitioning, prefetching, Intel file mode, etc). They get
  much better performance in future runs. See also madhyastha:thesis, and
  related papers.}
}

@InProceedings{madhyastha:global,
  author = {Tara M. Madhyastha and Daniel A. Reed},
  title = {Exploiting Global Input/Output Access Pattern Classification},
  booktitle = {Proceedings of SC97: High Performance Networking and Computing},
  year = {1997},
  month = {November},
  publisher = {ACM Press},
  address = {San Jose},
  later = {madhyastha:thesis},
  URL = {http://www.supercomp.org/sc97/proceedings/TECH/MADHYAST/INDEX.HTM},
  keywords = {file access pattern, parallel I/O, pario-bib},
  abstract = {Parallel input/output systems attempt to alleviate the
  performance bottleneck that affects many input/output intensive applications.
  In such systems, an understanding of the application access pattern,
  especially how requests from multiple processors for different file regions
  are logically related, is important for optimizing file system performance.
  We propose a method for automatically classifying these global access
  patterns and using these global classifications to select and tune file
  system policies to improve input/output performance. We demonstrate this
  approach on benchmarks and scientific applications using global
  classification to automatically select appropriate underlying Intel PFS
  input/output modes and server buffering strategies.},
  comment = {No page numbers: web and CDROM proceedings only. See also
  madhyastha:thesis and related papers.}
}

@InProceedings{madhyastha:informed,
  author = {Tara M. Madhyastha and Garth A. Gibson and Christos Faloutsos},
  title = {Informed Prefetching of Collective Input/Output Requests},
  booktitle = {Proceedings of SC99: High Performance Networking and Computing},
  year = {1999},
  month = {November},
  publisher = {ACM Press and IEEE Computer Society Press},
  address = {Portland, OR},
  URL = {http://www.sc99.org/proceedings/papers/madhyast.pdf},
  keywords = {informed prefetching, disk-directed I/O, parallel I/O,
  pario-bib},
  comment = {They argue that if enough application prefetches are made, a
  standard Unix interface will provide the same performance as a collective I/O
  interface. She uses simulation to show that if the file ordering is
  preserved, then the prefetch depth (the number of advance requests) is
  bounded by the number of disk drives. They look at two global access
  patterns: a simple interleaved sequential pattern and a 3-D block
  decomposition. Their experiment used 8 procs and 8 disks and did a comparison
  of the prefetching techniques to disk-directed I/O. Emperical studies showed
  that they needed a prefetch horizon of one to two times the number of disks
  to match the performance of disk-directed I/O, but the prefetching techniques
  require more memory.}
}

@Article{madhyastha:learning,
  author = {Tara M. Madhyastha and Daniel A. Reed},
  title = {Learning to Classify Parallel Input/Output Access Patterns},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {2002},
  month = {August},
  volume = {13},
  number = {8},
  pages = {802--813},
  URL = {http://computer.org/tpds/td2002/l0802abs.htm},
  keywords = {parallel I/O, file access pattern, pario-bib},
  abstract = {Input/output performance on current parallel file systems is
  sensitive to a good match of application access patterns to file system
  capabilities. Automatic input/output access pattern classification can
  determine application access patterns at execution time, guiding adaptive
  file system policies. In this paper, we examine and compare two novel
  input/output access pattern classification methods based on learning
  algorithms. The first approach uses a feedforward neural network previously
  trained on access pattern benchmarks to generate qualitative classifications.
  The second approach uses hidden Markov models trained on access patterns from
  previous executions to create a probabilistic model of input/output accesses.
  In a parallel application, access patterns can be recognized at the level of
  each local thread or as the global interleaving of all application threads.
  Classification of patterns at both levels is important for parallel file
  system performance; we propose a method for forming global classifications
  from local classifications. We present results from parallel and sequential
  benchmarks and applications that demonstrate the viability of this approach.}
}

@InProceedings{madhyastha:optimizing,
  author = {Tara M. Madhyastha and Christopher L. Elford and Daniel A. Reed},
  title = {Optimizing Input/Output Using Adaptive File System Policies},
  booktitle = {Proceedings of the Fifth NASA Goddard Conference on Mass Storage
  Systems and Technologies},
  year = {1996},
  month = {September},
  pages = {II:493--514},
  later = {madhyastha:thesis},
  keywords = {multiprocessor file system, prefetching, caching, parallel I/O,
  multiprocessor file system interface, pario-bib},
  comment = {See also madhyastha:thesis, and related papers.}
}

@PhdThesis{madhyastha:thesis,
  author = {Tara Madhyastha},
  title = {Automatic Classification of Input/Output Access Patterns},
  year = {1997},
  month = {August},
  school = {University of Illinois, Urbana-Champaign},
  URL =
  {http://www.cs.uiuc.edu/Dienst/UI/2.0/Describe/ncstrl.uiuc_cs/UIUCDCS-R-97-2021},
  keywords = {parallel I/O, file access pattern, pario-bib},
  comment = {See also madhyastha:classification, madhyastha:global,
  madhyastha:adaptive, madhyastha:optimizing.}
}

@InProceedings{magoutis:direct,
  author = {Kostas Magoutis and Salimah Addetia and Alexandra Fedorova and
  Margo I. Seltzer},
  title = {Making the Most Out of Direct-Access Network Attached Storage},
  booktitle = {Proceedings of the USENIX FAST '03 Conference on File and
  Storage Technologies},
  year = {2003},
  month = {April},
  publisher = {USENIX Association},
  address = {San Francisco, CA},
  URL = {http://www.usenix.org/events/fast03/tech/magoutis.html},
  keywords = {file systems, rpc optimizations, rdma, multi-client workload,
  small I/O, pario-bib},
  abstract = {The performance of high-speed network-attached storage
  applications is often limited by end-system overhead, caused primarily by
  memory copying and network protocol processing. In this paper, we examine
  alternative strategies for reducing overhead in such systems. We consider
  optimizations to remote procedure call (RPC)-based data transfer using either
  remote direct memory access (RDMA) or network interface support for
  pre-posting of application receive buffers. We demonstrate that both
  mechanisms enable file access throughput that saturates a 2Gb/s network link
  when performing large I/Os on relatively slow, commodity PCs. However, for
  multi-client workloads dominated by small I/Os, throughput is limited by the
  per-I/O overhead of processing RPCs in the server. For such workloads, we
  propose the use of a new network I/O mechanism, Optimistic RDMA (ORDMA).
  ORDMA is an alternative to RPC that aims to improve server throughput and
  response time for small I/Os. We measured performance improvements of up to
  32\% in server throughput and 36\% in response time with use of ORDMA in our
  prototype.}
}

@InProceedings{majumdar:characterize,
  author = {S. Majumdar and Yiu Ming Leung},
  title = {Characterization of applications with {I/O} for processor scheduling
  in multiprogrammed parallel systems},
  booktitle = {Proceedings of the 1994 IEEE Symposium on Parallel and
  Distributed Processing},
  year = {1994},
  pages = {298--307},
  publisher = {IEEE Computer Society Press},
  keywords = {workload characterization, scheduling, parallel I/O, pario-bib},
  abstract = {Most studies of processor scheduling in multiprogrammed parallel
  systems have ignored the I/O performed by applications. Recent studies have
  demonstrated that significant I/O operations are performed by a number of
  different classes of parallel applications. This paper focuses on some basic
  issues that underlie scheduling in multiprogrammed parallel environments
  running applications with I/O. Characterization of the I/O behavior of
  parallel applications is discussed first. Based on simulation models this
  research investigates the influence of these I/O characteristics on processor
  scheduling.}
}

@InProceedings{majumdar:management,
  author = {Shikaresh Majumdar and Faisal Shad},
  title = {Characterization and Management of {I/O} on Multiprogrammed Parallel
  Systems},
  booktitle = {Proceedings of the 1995 IEEE Symposium on Parallel and
  Distributed Processing},
  year = {1995},
  month = {October},
  pages = {502--510},
  publisher = {IEEE Computer Society Press},
  address = {San Antonio, TX},
  keywords = {workload characterization, parallel I/O, pario-bib},
  comment = {Analytical workload model. Simulation studies. See also
  kwong:distribution.}
}

@InProceedings{malluhi:pss,
  author = {Qutaibah Malluhi and William E. Johnston},
  title = {Approaches for a Reliable High-Performance Distributed-Parallel
  Storage System},
  booktitle = {Proceedings of the Fifth IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1996},
  month = {August},
  pages = {500--509},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, pario-bib},
  abstract = {The paper studies different schemes to enhance the reliability,
  availability and security of a high performance distributed storage system.
  We have previously designed a distributed parallel storage system that
  employs the aggregate bandwidth of multiple data servers connected by a high
  speed wide area network to achieve scalability and high data throughput. The
  general approach of the paper employs erasure error correcting codes to add
  data redundancy that can be used to retrieve missing information caused by
  hardware, software, or human faults. The paper suggests techniques for
  reducing the communication and computation overhead incurred while retrieving
  missing data blocks form redundant information. These techniques include
  clustering, multidimensional coding, and the full two dimensional parity
  scheme.}
}

@Article{manuel:logjam,
  author = {Tom Manuel},
  title = {Breaking the Data-rate Logjam with arrays of small disk drives},
  journal = {Electronics},
  year = {1989},
  month = {February},
  volume = {62},
  number = {2},
  pages = {97--100},
  keywords = {parallel I/O, disk array, I/O bottleneck, pario-bib},
  comment = {See also Electronics, Nov. 88 p 24, Dec. 88 p 112. Trade journal
  short on disk arrays. Very good intro. No new technical content. Concentrates
  on RAID project. Lists several commercial versions. Mostly concentrates on
  single-controller versions.}
}

@Article{marco:raid1,
  author = {R. Marco and J. Marco and D. Rodriguez and D. Cano and I.
  Cabrillo},
  title = {{RAID-1} and data stripping across the {GRID}},
  journal = {Lecture Notes in Computer Science},
  booktitle = {1st European Across Grids Conference; February 13-14, 2003;
  Santiago de Compostela, SPAIN},
  editor = {Rivera, FF; Bubak, M; Tato, AG; Doallo, R},
  year = {2004},
  month = {March},
  volume = {2970},
  pages = {119--123},
  institution = {Univ Cantabria, CSIC, Inst Fis Cantabria, Avda Los Castros
  S-N, E-39005 Santander, Spain; Univ Cantabria, CSIC, Inst Fis Cantabria,
  E-39005 Santander, Spain},
  publisher = {Springer-Verlag},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL =
  {http://springerlink.metapress.com/openurl.asp?genre=article&issn=0302-9743&volume=2970&spage=119},
  keywords = {RAID, RAID-1, data striping, GRID, pario-bib},
  abstract = {Stripping techniques combined with an adequate replication policy
  across the Grid offer the possibility to improve significatively data access
  and processing times, while eliminating the need for local data mirroring, so
  saving significatively on storage costs. First results on a local cluster
  following a simple strategy are presented.}
}

@Misc{maspar:pario,
  key = {Mas},
  title = {Parallel File {I/O} Routines},
  year = {1992},
  howpublished = {MasPar Computer Corporation},
  keywords = {parallel I/O, multiprocessor file system interface, pario-bib},
  comment = {Man pages for MasPar file system interface. They have either a
  single shared file pointer, after which all processors read or write in an
  interleaved pattern, or individual (plural) file pointer, allowing arbitrary
  access patterns. Updated in 1992 with many more features.}
}

@Article{masters:pario,
  author = {Del Masters},
  title = {Improve Disk Subsystem Performance with Multiple Serial Drives in
  Parallel},
  journal = {Computer Technology Review},
  year = {1987},
  month = {July},
  volume = {7},
  number = {9},
  pages = {76--77},
  keywords = {parallel I/O, pario-bib},
  comment = {Information about the early Maximum Strategy disk array, which
  striped over 4 disk drives, apparently synchronously.}
}

@Article{matloff:multidisk,
  author = {Norman S. Matloff},
  title = {A Multiple-Disk System for both Fault Tolerance and Improved
  Performance},
  journal = {IEEE Transactions on Reliability},
  year = {1987},
  month = {June},
  volume = {R-36},
  number = {2},
  pages = {199--201},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, reliability, disk shadowing, disk mirroring,
  pario-bib},
  comment = {Variation on mirrored disks using more than 2 disks, to spread the
  files around. Good performance increases.}
}

@InProceedings{matthews:hippi,
  author = {Kevin C. Matthews},
  title = {Experiences Implementing a Shared File System on a {HIPPI} Disk
  Array},
  booktitle = {Proceedings of the Fourteenth IEEE Symposium on Mass Storage
  Systems},
  year = {1995},
  month = {September},
  pages = {77--88},
  publisher = {IEEE Computer Society Press},
  keywords = {mass storage, distributed file system, parallel I/O, pario-bib},
  abstract = {Shared file systems which use a physically shared mass storage
  device have existed for many years, although not on UNIX based operating
  systems. This paper describes a shared file system (SFS) that was implemented
  first as a special project on the Gray Research Inc. (CRI) UNICOS operating
  system. A more general product was then built on top of this project using a
  HIPPI disk array for the shared mass storage. The design of SFS is outlined,
  as well as some performance experiences with the product. We describe how SFS
  interacts with the OSF distributed file service (DFS) and with the CRI data
  migration facility (DMF). We also describe possible development directions
  for the SFS product.},
  comment = {They use hardware to tie the same storage device (a disk array) to
  several computers (Cray C90s). They build a custom piece of hardware just to
  service semaphore requests very fast. HIPPI is the interconnect. Details have
  a lot to do with the synchronization between processors trying to update the
  same metadata; that's why they use the semaphores.}
}

@InProceedings{matthijs:framework,
  author = {F. Matthijs and Y. Berbers and P. Verbaeten},
  title = {A flexible {I/O} framework for parallel and distributed systems},
  booktitle = {Proceedings of the Fifth International Workshop on Object
  Orientation in Operating Systems},
  year = {1995},
  pages = {187--190},
  publisher = {IEEE Computer Society Press},
  keywords = {input-output programs, object-oriented, parallel systems; I/O
  performance, migration, dynamic load balancing, fault tolerance, parallel
  I/O, pario-bib},
  abstract = {We propose a framework for I/O in parallel and distributed
  systems. The framework is highly customizable and extendible, and enables
  programmers to offer high level objects in their applications, without
  requiring them to struggle with the low level and sometimes complex details
  of high performance distributed I/O. Also, the framework exploits application
  specific information to improve I/O performance by allowing specialized
  programmers to customize the framework. Internally, we use indirection and
  granularity control to support migration, dynamic load balancing, fault
  tolerance, etc. for objects of the I/O system, including those representing
  application data.}
}

@InProceedings{mayr:query,
  author = {Tobias Mayr and Philippe Bonnet and Johannes Gehrke and Praveen
  Seshadri},
  title = {Leveraging Non-Uniform Resources for Parallel Query Processing},
  booktitle = {Proceedings of the Third IEEE/ACM International Symposium on
  Cluster Computing and the Grid},
  year = {2003},
  month = {May},
  pages = {120--129},
  publisher = {IEEE Computer Society Press},
  address = {Tokyo, Japan},
  URL =
  {http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190120abs.htm},
  keywords = {parallel query processing, load balancing, parallel I/O,
  pario-bib},
  abstract = {Modular clusters are now composed of non- uniform nodes with
  different CPUs, disks or network cards so that customers can adapt the
  cluster configuration to the changing technologies and to their changing
  needs. This challenges dataflow parallelism as the primary load balancing
  technique of existing parallel database systems. We show in this paper that
  dataflow parallelism alone is ill suited for modular clusters because running
  the same operation on different subsets of the data can not fully utilize
  non-uniform hardware resources. We propose and evaluate new load balancing
  techniques that blend pipeline parallelism with data parallelism. We consider
  relational operators as pipelines of fine-grained operations that can be
  located on different cluster nodes and executed in parallel on different data
  subsets to best exploit non-uniform resources. We present an experimental
  study that confirms the feasibility and effectiveness of the new techniques
  in a parallel execution engine prototype based on the open-source DBMS
  Predator.}
}

@InProceedings{mcmurdy:unstripe,
  author = {Ronald K. McMurdy and Badrinath Roysam},
  title = {Improving {RAID-5} Performance by Un-striping Moderate-sized Files},
  booktitle = {Proceedings of the 1993 International Conference on Parallel
  Processing},
  year = {1993},
  pages = {II--279--282},
  publisher = {CRC Press},
  address = {St. Charles, IL},
  keywords = {parallel I/O, disk array, pario-bib, RAID},
  comment = {Allocate small- and medium-sized files entirely on one disk rather
  than striped, to cut seek and rotation latency that would happen if they were
  spread across many disks.}
}

@InProceedings{meador:array,
  author = {Wes E. Meador},
  title = {Disk Array Systems},
  booktitle = {Proceedings of IEEE Compcon},
  year = {1989},
  month = {Spring},
  pages = {143--146},
  keywords = {parallel I/O, disk array, disk striping, pario-bib},
  comment = {Describes {\em Strategy 2 Disk Array Controller}, which allows 4
  or 8 drives, hardware striped, with parity drive and 0-4 hot spares. Up to 4
  channels to cpu(s). Logical block interface. Defects, errors, formatting,
  drive failures all handled automatically. Peak 40 MB/s data transfer on each
  channel.}
}

@Misc{meiko:cs2,
  key = {Meiko},
  title = {Computing Surface {CS-2}: Technical Overview},
  year = {1993},
  howpublished = {Meiko brochure S1002-10M115.01A},
  keywords = {multiprocessor architecture, parallel I/O, pario-bib},
  comment = {Three node types: 4 SPARC (50 MHz), 1 SPARC + two Fujitsu vector
  procs, or 1 SPARC + 3 I/O ports. All have a special communications processor
  that supports remote memory access. Each has 128 MBytes in 16 banks.
  Memory-memory transfer operations using ``remote DMA'', supported by the
  communications processor. User-level comm interface, with protection. Uses
  multistage network with 8x8 crossbar switches, looks like a fat tree. S/BUS,
  separate from the memory bus, is used for I/O, either directly, or through 2
  SCSI and 1 ethernet. Control and diagnostic networks. Parallel file system
  stripes across multiple partitions. Can use RAID. Communications processor
  has its own MMU; control registers are mapped to user space. Network-wide
  virtual addresses can support shared memory? Remote store, atomic operations,
  global operations. Comm proc can support I/O threads -- but can it talk to
  the disks? OS based on Solaris 2, plus global shared memory, parallel file
  system, and capability-based protection. Machine is logically partitioned
  into login, devices, and parallel computation.}
}

@InProceedings{memik:patterns,
  author = {Gokhan Memik and Mahmut Kandemir and Alok Choudhary},
  title = {Exploiting Inter-File Access Patterns Using Multi-Collective {I/O}},
  booktitle = {Proceedings of the USENIX FAST '02 Conference on File and
  Storage Technologies},
  year = {2002},
  month = {January},
  pages = {245--258},
  publisher = {USENIX Association},
  address = {Monterey, CA},
  URL =
  {http://www.usenix.org/publications/library/proceedings/fast02/memik.html},
  keywords = {file systems, pario-bib},
  abstract = {This paper introduces a new concept called Multi-Collective I/O
  (MCIO) that extends conventional collective I/O to optimize I/O accesses to
  multiple arrays simultaneously. In this approach, as in collective I/O,
  multiple processors co-ordinate to perform I/O on behalf of each other if
  doing so improves overall I/O time. However, unlike collective I/O, MCIO
  considers multiple arrays simultaneously; that is, it has a more global view
  of the overall I/O behavior exhibited by application. This paper shows that
  determining optimal MCIO access pattern is an NP-complete problem, and
  proposes two different heuristics for the access pattern detection problem
  (also called the assignment problem). Both of the heuristics have been
  implemented within a runtime library, and tested using a large-scale
  scientific application. Our preliminary results show that MCIO out-performs
  collective I/O by as much as 87{PCT}. Our runtime library-based
  implementation can be used by users as well as optimizing compilers. Based on
  our results, we recom-mend future library designers for I/O-intensive
  applications to include MCIO in their suite of optimizations.}
}

@InProceedings{menasce:mass,
  author = {Daniel Menasc\'e and Odysseas Ionnis Pentakalos and Yelena Yesha},
  title = {An Analytic Model of Hierarchical Mass Storage Systems With
  Network-Attached Storage Devices},
  booktitle = {Proceedings of the 1996 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1996},
  month = {May},
  pages = {180--189},
  publisher = {ACM Press},
  address = {Philadelphia, PA},
  keywords = {network attached peripherals, analytic model, mass storage,
  parallel I/O, pario-bib}
}

@InCollection{menon:bcompare,
  author = {Jai Menon},
  title = {A Performance Comparison of {RAID-5} and Log-Structured Arrays},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {4},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {55--64},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {menon:compare},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {RAID, disk array, parallel I/O, pario-bib},
  comment = {Part of jin:io-book; reformatted version of menon:compare.}
}

@InCollection{menon:bsparing,
  author = {Jai Menon},
  title = {Comparison of Sparing Alternatives for Disk Arrays},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {9},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {117--128},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {menon:sparing},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, RAID, disk array, pario-bib},
  comment = {Part of jin:io-book; reformatted version of menon:sparing.}
}

@InProceedings{menon:compare,
  author = {Jai Menon},
  title = {A Performance Comparison of {RAID-5} and Log-structured Arrays},
  booktitle = {Proceedings of the Fourth IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1995},
  month = {August},
  pages = {167--178},
  later = {menon:bcompare},
  keywords = {RAID, disk array, parallel I/O, pario-bib},
  comment = {He compares a RAID-5 disk array with a log-structured array (LSA).
  An LSA is essentially an implementation of a log-structured file system
  inside a disk controller. The disk controller buffers up writes in a
  non-volatile cache; when the outgoing data buffer is full, it is written to
  some large contiguous region of the disk. The controller manages a directory
  to keep track of the various segment locations, and does garbage collection
  (cleaning). They can insert a compression algorithm in front of the cache so
  that they get better cache and disk utilization by storing data in compressed
  form. for fair comparison they compare with a similar feature in the plain
  RAID5 array.}
}

@Article{menon:daisy,
  author = {Jai Menon and Kent Treiber},
  title = {{Daisy}: Virtual-disk Hierarchical Storage Manager},
  journal = {ACM SIGMETRICS Performance Evaluation Review},
  year = {1997},
  month = {December},
  volume = {25},
  number = {3},
  pages = {37--44},
  URL = {http://doi.acm.org/10.1145/270900.270908},
  keywords = {hierarchical storage, tape storage, tertiary storage, tape robot,
  parallel I/O, pario-bib},
  comment = {Part of a special issue on parallel and distributed I/O.}
}

@InProceedings{menon:sparing,
  author = {Jai Menon and Dick Mattson},
  title = {A Comparison of Sparing Alternatives for Disk Arrays},
  booktitle = {Proceedings of the 19th Annual International Symposium on
  Computer Architecture},
  year = {1992},
  pages = {318--329},
  publisher = {ACM Press},
  later = {menon:bsparing},
  URL = {http://portal.acm.org/citation.cfm?id=139669.140392},
  keywords = {parallel I/O, RAID, disk array, pario-bib},
  abstract = {This paper explores how choice of sparing methods impacts the
  performance of RAID level 5 (or parity striped) disk arrays. The three
  sparing methods examined are dedicated sparing, distributed sparing, and
  parity sparing. For database type workloads with random single block reads
  and writes, array performance is compared in four different modes - normal
  mode (no disks have failed), degraded mode (a disk has failed and its data
  has not been reconstructed), rebuild mode (a disk has failed and its data is
  being reconstructed), and copyback mode(which is needed for distributed
  sparing and parity sparing when failed disks are replaced with new disks).
  Attention is concentrated on small disk subsystems (fewer than 32 disks)
  where choice of sparing method has significant impact on array performance,
  rather than large disk subsystems (64 or more disks). It is concluded that,
  for disk subsystems with a small number of disks, distributed sparing offers
  major advantages over dedicated sparing in normal, degraded and rebuild modes
  of operation, even if one has to pay a copyback penalty. Furthermore, it is
  better than parity sparing in rebuild mode and similar to it in other
  operating modes, making it the sparing method of choice.}
}

@Article{menor:grid-io,
  author = {Jos\'{e} M. P\'{e}rez Menor and F\'{e}lix Garc\'{\i}a and Jes\'{u}s
  Carretero and Alejandro Calder\'{o}n and Javier Fern\'{a}ndez and Jos\'{e}
  Daniel Garc\'{\i}a},
  title = {A parallel {I/O} middleware to integrate heterogeneous storage
  resources on grids},
  journal = {Lecture Notes in Computer Science},
  booktitle = {1st European Across Grids Conference; February 13-14, 2003;
  Santiago de Compostela, SPAIN},
  editor = {Rivera, FF; Bubak, M; Tato, AG; Doallo, R},
  year = {2004},
  month = {March},
  volume = {2970},
  pages = {124--131},
  institution = {Univ Carlos III Madrid, Comp Architecture Grp, Dept Comp Sci,
  Madrid, Spain},
  publisher = {Springer-Verlag},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL =
  {http://springerlink.com/openurl.asp?genre=article&issn=0302-9743&volume=2970&spage=124},
  keywords = {data grids, parallel I/O, data declustering, pario-bib},
  abstract = {The philosophy behind grid is to use idle resources to achieve a
  higher level of computational services (computation, storage, etc). Existing
  data grids solutions are based in new servers, specific APIs and protocols,
  however this approach is not a realistic solution for enterprises and
  universities, because this supposes the deployment of new data servers across
  the company. This paper describes a new approach to data access in
  computational grids. This approach is called GridExpand, a parallel I/O
  middleware that integrates heterogeneous data storage resources in grids. The
  proposed grid solution integrates available data network solutions (NFS,
  CIFS, WebDAV) and makes possible the access to a global grid file system. Our
  solution differs from others because it does not need the installation of new
  data servers with new protocols. Most of the data grid solutions use
  replication as the way to obtain high performance. Replication, however,
  introduce consistency problem for many collaborative applications, and
  sometimes requires the usage of lots of resources. To obtain high
  performance, we apply the parallel I/O techniques used in parallel file
  systems.}
}

@Article{merchant:striping,
  author = {Arif Merchant and Philip S. Yu},
  title = {Analytic Modeling and Comparisons of Striping Strategies of
  Replicated Disk Arrays},
  journal = {IEEE Transactions on Computers},
  year = {1995},
  month = {March},
  volume = {44},
  number = {3},
  pages = {419--431},
  publisher = {IEEE Computer Society Press},
  keywords = {disk striping, disk array, RAID, parallel I/O, pario-bib}
}

@InProceedings{merriam:triangle,
  author = {Drshal L. Merriam},
  title = {Parallel Implementation of an Algorithm for {Delaunay}
  Triangulation},
  booktitle = {Proceedings of Computational Fluid Dynamics},
  year = {1992},
  volume = {2},
  pages = {907--912},
  keywords = {parallel I/O, file system workload, pario-bib},
  comment = {This application runs on the NASA Ames iPSC/860. This application
  has some I/O: reading in the input file, which is a set of x,y,z data points.
  I/O was really slow if formatted (ie, ASCII instead of binary) or sequential
  instead of parallel. Any input record could go to any processor; the first
  step in the algorithm (after the points are read in) is essentially a kind of
  sort to move points around to localize points and balance load.}
}

@Article{messerli:jimage,
  author = {Vincent Messerli and Oscar Figueiredo and B. Gennart and Roger D.
  Hersch},
  title = {Parallelizing {I/O}-intensive image access and processing
  applications},
  journal = {IEEE Concurrency},
  year = {1999},
  volume = {7},
  number = {2},
  pages = {28--37},
  URL = {http://diwww.epfl.ch/w3lsp/publications/gigaserver/piiiaapa.pdf},
  keywords = {applications, image processing, pario-app, parallel I/O,
  pario-bib},
  abstract = {This article presents methods and tools for building parallel
  applications based on commodity components: PCs, SCSI disks, Fast Ethernet,
  Windows NT. Chief among these tools is CAP, our computer-aided
  parallelization tool. CAP generates highly pipelined applications that run
  communication and I/O operations in parallel with processing operations. One
  of CAP's successes is the Visible Human Slice Server, a 3D tomographic image
  server that allows clients to choose and view any cross section of the human
  body.}
}

@PhdThesis{messerli:thesis,
  author = {Vincent Messerli},
  title = {Tools for Parallel {I/O} and Compute Intensive Applications},
  year = {1999},
  school = {\'Ecole Polytechnique F\'ed\'erale de Lausanne},
  note = {Th\`ese 1915},
  keywords = {parallel computing, image processing, parallel I/O application,
  parallel I/O, pario-bib},
  comment = {The complete description of PS$^2$ and its use with CAP, a
  parallelization tool, for data-flow-like support of parallel I/O. Nice work.
  See also messerli:jimage, gennart:CAP, vetsch:visiblehuman,
  messerli:tomographic.}
}

@InProceedings{messerli:tomographic,
  author = {V. Messerli, B. Gennart, R.~D. Hersch},
  title = {Performances of the {PS$^2$} Parallel Storage and Processing System
  for Tomographic Image Visualization},
  booktitle = {Proceedings of the Seventeenth International Conference on
  Distributed Computer Systems},
  year = {1997},
  month = {December},
  pages = {514--522},
  publisher = {IEEE Computer Society Press},
  address = {Seoul, Korea},
  URL = {http://diwww.epfl.ch/w3lsp/pub/publications/ps2/potppsapsftiv.html},
  keywords = {parallel computing, parallel I/O, parallel I/O application, image
  processing, pario-bib},
  abstract = {We propose a new approach for developing parallel I/O-
  andcompute-intensive applications. At a high level of abstraction, a macro
  data flow description describes how processing and disk access operations are
  combined. This high-level description (CAP) is precompiled into compilable
  and executable C++ source language. Parallel file system components specified
  by CAP are offered as reusable CAP operations. Low-level parallel file system
  components can, thanks to the CAP formalism, be combined with processing
  operations in order to yield efficient pipelined parallel I/O and compute
  intensive programs. The underlying parallel system is based on commodity
  components (PentiumPro processors, Fast Ethernet) and runs on top of
  WindowsNT. The CAP-based parallel program development approach is applied to
  the development of an I/O and processing intensive tomographic 3D image
  visualization application. Configurations range from a single PentiumPro
  1-disk system to a four PentiumPro 27-disk system. We show that performances
  scale well when increasing the number of processors and disks. With the
  largest configuration, the system is able to extract in parallel and project
  into the display space between three and four 512x512 images per second. The
  images may have any orientation and are extracted from a 100 MByte 3D
  tomographic image striped over the available set of disks.},
  comment = {See also messerli:jimage, gennart:CAP, vetsch:visiblehuman,
  messerli:thesis.}
}

@Article{michael:future,
  author = {Gavin Michael and Andrew Chien},
  title = {Future Multicomputers: Beyond Minimalist Multiprocessors?},
  journal = {Computer Architecture News},
  year = {1992},
  month = {December},
  volume = {20},
  number = {5},
  pages = {6--12},
  keywords = {multiprocessor architecture, compiler, parallel I/O, pario-bib},
  comment = {Includes some comments by Randy Katz about parallel I/O, in
  particular, distinguishing between ``fat'' nodes (with many disks, e.g., a
  RAID), and ``thin'' nodes (with one disk).}
}

@TechReport{milenkovic:model,
  author = {Milan Milenkovi\'c},
  title = {A Model for Multiprocessor {I/O}},
  year = {1989},
  month = {July},
  number = {89-CSE-30},
  institution = {Dept. of Computer Science and Engineering, Southern Methodist
  University},
  keywords = {multiprocessor I/O, I/O architecture, distributed system,
  pario-bib},
  comment = {Advocates using dedicated server processors for all I/O, e.g., disk
  server, terminal server, network server. Pass I/O requests and data via
  messages or RPC calls over the interconnect (here a shared bus). Server
  handles packaging, blocking, caching, errors, interrupts, and so forth,
  freeing the main processors and the interconnect from all this activity.
  Benefits: encapsulates I/O-related stuff in specific places, accommodates
  heterogeneity, improves performance. Nice idea, but allows for an I/O
  bottleneck, unless server can handle all the demand. Otherwise would need
  multiple servers, more expensive than just multiple controllers.}
}

@InProceedings{miller:iobehave,
  author = {Ethan L. Miller and Randy H. Katz},
  title = {Input/Output Behavior of Supercomputer Applications},
  booktitle = {Proceedings of Supercomputing '91},
  year = {1991},
  month = {November},
  pages = {567--576},
  publisher = {IEEE Computer Society Press},
  address = {Albuquerque, NM},
  keywords = {file access pattern, supercomputer, disk caching, prefetching,
  pario-bib},
  comment = {Same as miller:iobehave-tr except without the appendix outlining
  trace format. Included in pario-bibliography not because it measures a
  parallel workload, but because it is so often cited in the parallel-IO
  community.}
}

@Article{miller:jrama,
  author = {Ethan L. Miller and Randy H. Katz},
  title = {{RAMA}: An Easy-To-Use, High-Performance Parallel File System},
  journal = {Parallel Computing},
  year = {1997},
  month = {June},
  volume = {23},
  number = {4-5},
  pages = {419--446},
  publisher = {North-Holland (Elsevier Scientific)},
  earlier = {miller:rama2},
  URL = {http://dx.doi.org/10.1016/S0167-8191(97)00008-2},
  keywords = {multiprocessor file system, parallel I/O, pario-bib},
  abstract = {Modern massively parallel file systems provide high bandwidth
  file access by striping files across arrays of disks attached to a few
  specialized I/O nodes. However, these file systems are hard to use and
  difficult to integrate with workstations and tertiary storage. RAMA addresses
  these problems by providing a high-performance massively parallel file system
  with a simple interface. RAMA uses hashing to pseudo-randomly distribute data
  to all of its disks, insuring high bandwidth regardless of access pattern and
  eliminating bottlenecks in file block accesses. This flexibility does not
  cause a large loss of performance - RAMA's simulated performance is within
  10-15\% of the optimum performance of a similarly-sized striped file system,
  and is a factor of 4 or more better than a striped file system with poorly
  laid out data.},
  comment = {They use parallel disks of a multiprocessor as a set-associative
  cache for tertiary storage. Each "disk line" contains a set of blocks, and a
  little index that lists the blocks contained in the disk line. To access
  block b of a file, you hash on b/s, where s is a small factor like 4; that
  encourages consecutive blocks to land in the same disk line, for better
  locality. That gives you the disk line number. From that you compute the disk
  number, and the node number. Send a message to that node. It reads through
  the index for that disk line to find the block within the line. Metadata like
  file permissions are stored in the disk line with the first block of the
  file. Part of the paper deals with file-system integrity; no fsck is needed.
  When RAMA goes to tertiary storage, it reads a large batch of the file, but
  need not read the entire file into disk cache. Dirty data are flushed back to
  tertiary store periodically. \par They use simulation to study performance
  with synthetic access patterns. Unfortunately they simulated rather small
  files and patterns. The paper talks quite a bit about disk (space and
  bandwidth) utilization, and network bandwidth utilization. One of the big
  benefits of this hash-based approach is that it tends to distribute the
  traffic to the network and to the disks very evenly, even under highly
  regular access patterns that might unbalance a traditional striped approach.
  Finally, they claim to do well on small-file workloads as well as
  supercomputer workloads.}
}

@Article{miller:pario,
  author = {L. L. Miller and A. R. Hurson},
  title = {Multiprogramming and concurrency in parallel file environments},
  journal = {International Journal of Mini and Microcomputers},
  year = {1991},
  volume = {13},
  number = {2},
  pages = {37--45},
  keywords = {parallel file system, parallel I/O, database, pario-bib},
  comment = {This is really for databases. They identify two types of file
  access: one where the file can be operated on as a set of subfiles, each
  independently by a processor (what they call MIMD mode), and another where
  the file must be operated on with a centralized control (SIMD mode), in their
  case to search a B-tree whose nodes span the set of processors. Basically it
  is a host connected to a controller, that is connected to a set of small I/O
  processors, each of which has access to disk. In many ways a uniprocessor
  perspective. Paper design, with simulation results.}
}

@Article{miller:pass,
  author = {L.~L. Miller and S.~R. Inglett and A.~R. Hurson},
  title = {{PASS}--- A Multiuser Parallel File System Based on Microcomputers},
  journal = {Journal of systems and software},
  year = {1992},
  month = {September},
  volume = {19},
  number = {1},
  pages = {75--83},
  keywords = {parallel I/O, parallel file system, multiprocessor file system,
  pario-bib},
  abstract = {Data intensive computer applications suffer from inadequate use
  of parallelism for processing data stored on secondary storage devices.
  Devices such as database machines are useful in some applications, but many
  applications are too small or specialized to use such technology. To bridge
  this gap, the authors introduce the parallel secondary storage (PASS) system.
  PASS is based on a network of microcomputers. The individual microcomputers
  are assigned to a unit of secondary storage and the operations of the
  microcomputers are initiated and monitored by a control processor. The file
  system is capable of acting as either an SIMD or an MIMD machine.
  Communication between the individual microcomputers and the control processor
  is described. The integration of the multiple microcomputers into the
  primitive operations on a file is examined. Finally, the strategies employed
  to enhance performance in the multiprogramming environment are discussed.}
}

@Article{miller:pfs,
  author = {L. L. Miller and S. R. Inglett},
  title = {Enhancing performance in a parallel file system},
  journal = {Microprocessing and Microprogramming},
  year = {1994},
  month = {May},
  volume = {40},
  number = {4},
  pages = {261--274},
  keywords = {parallel I/O, parallel file system, pario-bib}
}

@InProceedings{miller:radar,
  author = {Craig Miller and David G. Payne and Thanh N. Phung and Herb Siegel
  and Roy Williams},
  title = {Parallel Processing of Spaceborne Imaging Radar Data},
  booktitle = {Proceedings of Supercomputing '95},
  year = {1995},
  publisher = {IEEE Computer Society Press},
  address = {San Diego, CA},
  URL = {http://www.supercomp.org/sc95/proceedings/012_PAYN/SC95.HTM},
  keywords = {parallel I/O, pario-bib},
  abstract = {We discuss the results of a collaborative project on parallel
  processing of Synthetic Aperture Radar (SAR) data, carried out between the
  NASA/Jet Propulsion Laboratory (JPL), the California Institute of Technology
  (Caltech) and Intel Scalable Systems Division (SSD). Through this
  collaborative effort, we have successfully parallelized the most
  compute-intensive SAR correlator phase of the Spaceborne Shuttle Imaging
  Radar-C/X-Band SAR (SIR-C/X-SAR) code, for the Intel Paragon. We describe the
  data decomposition, the scalable high-performance I/O model, and the
  node-level optimizations which enable us to obtain efficient processing
  throughput. In particular, we point out an interesting double level of
  parallelization arising in the data decomposition which increases
  substantially our ability to support ``high volume'' SAR. Results are
  presented from this code running in parallel on the Intel Paragon. A
  representative set of SAR data, of size 800 Megabytes, which was collected by
  the SIR-C/X-SAR instrument aboard NASA's Space Shuttle in 15 seconds, is
  processed in 55 seconds on the Concurrent Supercomputing Consortium's Paragon
  XP/S 35+. This compares well with a time of 12 minutes for the current
  SIR-C/X-SAR processing system at JPL. For the first time, a commercial system
  can process SIR-C/X-SAR data at a rate which is approaching the rate at which
  the SIR-C/X-SAR instrument can collect the data. This work has successfully
  demonstrated the viability of the Intel Paragon supercomputer for processing
  ``high volume'' Synthetic Aperture Radar data in near real-time.},
  comment = {Available only on CD-ROM and WWW.}
}

@InProceedings{miller:rama,
  author = {Ethan L. Miller and Randy H. Katz},
  title = {{RAMA:} A File System for Massively-Parallel Computers},
  booktitle = {Proceedings of the Twelfth IEEE Symposium on Mass Storage
  Systems},
  year = {1993},
  pages = {163--168},
  later = {miller:rama2},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {The multiprocessor's file system acts as a block cache for
  tertiary storage. Disk space is broken into ``lines'' of a few MB. Each line
  has a descriptor telling what blocks it has, and their status. (fileid,
  offset) hashed to find (disk, linenum). Intrinsic metadata stored at start of
  each file; positional metadata implicit in hashing, and line descriptors.
  Sequentiality parameter puts several blocks of a file in the same line, to
  improve medium-sized requests (otherwise generate lots of request-response
  net traffic). Not clear on best choice of size. No mention of atomicity wrt
  concurrent writes to same data. Blocks migrate to tertiary storage as they
  get old. Fetched on demand, by block (not file). Self-describing blocks have
  ids in block -- leads to screwy block sizes?}
}

@InProceedings{miller:rama2,
  author = {Ethan L. Miller and Randy H. Katz},
  title = {{RAMA}: Easy Access to a High-Bandwidth Massively Parallel File
  System},
  booktitle = {Proceedings of the 1995 USENIX Technical Conference},
  year = {1995},
  month = {January},
  pages = {59--70},
  earlier = {miller:rama},
  later = {miller:jrama},
  keywords = {parallel file system, pario-bib},
  comment = {Simulation results. RAMA distributes blocks of each file randomly
  across disks, which are attached to all processor nodes, using a hash
  function. Thus there is no centralized metadata. The big benefit is uniform
  performance regardless of access pattern; they found one situation where it
  was 10\% slower than an optimal striped layout, but many cases where they
  were as much as 4 times faster than bad striped data layouts. So, they can
  give reasonable performance without the need for programmer- or
  manager-specified data layouts.}
}

@Article{milligan:bifs,
  author = {P. Milligan and L. C. Waring and A. S. C. Lee},
  title = {{BIFS}: {A} filing system for multiprocessor based systems},
  journal = {Microprocessing and Microprogramming},
  year = {1991},
  volume = {31},
  pages = {9--12},
  note = {Euromicro~'90 conference, Amsterdam},
  keywords = {multiprocessor file system, pario-bib},
  comment = {A simple file system for a transputer network, attached to a
  single disk device. Several procs are devoted to the file system, but really
  just act as buffers for the host processor that runs the disk. They provide
  sequential, random access, and indexed files, either byte- or
  record-oriented. Some prototypes; no results. They add buffering and double
  buffering, but don't really get into anything interesting.}
}

@Article{miya:biblio,
  author = {Eugene N. Miya},
  title = {Multiprocessor/Distributed Processing Bibliography},
  journal = {Computer Architecture News},
  year = {1985},
  month = {March},
  volume = {13},
  number = {1},
  pages = {27--29},
  note = {Much updated since then, now kept on-line},
  keywords = {bibliography, parallel computing, distributed computing,
  pario-bib},
  comment = {This reference is the original publication of Eugene's annotated
  bibliography. It has grown tremendously and is now huge. Because of the
  copyright considerations, you can't just nab it off the net, but it is free
  for the asking from Eugene. Send mail to eugene@nas.nasa.gov.}
}

@Article{miyamura:adventure-io,
  author = {Tomoshi Miyamura and Shinobu Yoshimura},
  title = {Generalized {I/O} data format and interface library for module-based
  parallel finite element analysis system.},
  journal = {Advances in Engineering Software},
  year = {2004},
  month = {March},
  volume = {35},
  number = {3--4},
  pages = {149--159},
  institution = {Nihon Univ, Coll Engn, Dept Comp Sci, 1 Nakagawara, Koriyama,
  Fukushima 9638642, Japan; Nihon Univ, Coll Engn, Dept Comp Sci, Koriyama,
  Fukushima 9638642, Japan; Univ Tokyo, Grad Sch Frontier Sci, Inst Environm
  Studies, Bunkyo Ku, Tokyo 1138656, Japan},
  publisher = {UK : Elsevier, 2004},
  copyright = {(c)2005 Elsevier Engineering Information, Inc.; IEE; The Thomson
  Corporation},
  keywords = {data format, finite element method, generalized I/O data,
  hierarchical domain decomposition, pario-app, pario-bib},
  abstract = {In this paper, a generalized input/output (I/O) data format and
  library for a module-based parallel finite element analysis system are
  proposed. The module-based system consists of pre-, main- and post-modules,
  as well as some common libraries. The present I/O library, called
  ADVENTURE_IO, and data format are developed specifically for use in parallel
  high-performance computational mechanics system. These are rather simple
  compared to other general-purpose I/O systems such as netCDF and HDF5. A
  simple container called a finite element generic attributes (FEGAs) document
  enables the handling of almost all the I/O data in a parallel finite element
  method code. Due to the simplicity of the present system, tuning up the I/O
  library for a specific parallel environment is easy. Other major features of
  the present system are: (1) it possesses a generalized collaboration
  mechanism consisting of multiple modules in a distributed computing
  environment employing common object request broker architecture, and (2)
  abstracted data description employed in the FEGA/HDDM_FEGA document enables
  the development of a unique domain decomposer that can subdivide any kind of
  input data.}
}

@InProceedings{mogi:parity,
  author = {Kazuhiko Mogi and Masaru Kitsuregawa},
  title = {Dynamic Parity Stripe Reorganizations for {RAID5} Disk Arrays},
  booktitle = {Proceedings of the Third International Conference on Parallel
  and Distributed Information Systems},
  year = {1994},
  month = {September},
  pages = {17--26},
  keywords = {disk array, RAID, disk striping, parallel I/O, pario-bib},
  abstract = {RAID5 disk arrays provide high performance and high reliability
  for reasonable cost. However RAIDS suffers a performance penalty during block
  updates. We examine the feasibility of using "dynamic parity striping" to
  improve the performance of block updates. Instead of updating each block
  independently, this method buffers a number of updates, generates a new
  stripe composed of the newly updated blocks, then writes the full stripe back
  to disk. Two implementations are considered in this paper. One is a
  log-structured file system (LFS) based method and the other is Virtual
  Striping. Both methods achieve much higher performance than conventional
  approaches. The performance characteristics of the LFS based method and the
  Virtual Striping method are clarified.}
}

@Article{mokhoff:pario,
  author = {Nicholas Mokhoff},
  title = {Parallel Disk Assembly Packs 1.5 {GBytes}, runs at 4 {MBytes/s}},
  journal = {Electronic Design},
  year = {1987},
  month = {November},
  pages = {45--46},
  keywords = {parallel I/O, I/O, disk architecture, disk striping, reliability,
  pario-bib},
  comment = {Commercially available: Micropolis Systems' Parallel Disk 1800
  series. Four disks plus one parity disk, synchronized and byte-interleaved.
  SCSI interface. Total capacity 1.5 GBytes, sustained transfer rate of 4
  MBytes/s. MTTF 140,000 hours. Hard and soft errors corrected in real-time.
  Failed drives can be replaced while system is running.}
}

@InCollection{molero:modeling,
  author = {Xavier Molero and Federico Silla and Vicente Santonja and Jos\'e
  Duato},
  title = {Modeling and Evaluation of {Fibre Channel} Storage Area Networks},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {31},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {464--473},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {storage area network, pario-bib},
  comment = {Part of jin:io-book.}
}

@TechReport{montague:swift,
  author = {Bruce R. Montague},
  title = {The {Swift/RAID} Distributed Transaction Driver},
  year = {1993},
  month = {January},
  number = {UCSC-CRL-93-99},
  institution = {UC Santa Cruz},
  keywords = {RAID, parallel I/O, distributed file system, transaction,
  pario-bib},
  comment = {See other Swift papers, e.g., cabrera:pario and long:swift-raid.
  This paper describes the basic idea of a using a transaction driver to
  implement RAID over a distributed system. Then it spends most of the time
  describing the details of the implementation. The basic idea is that
  processors execute transaction drivers, which provide virtual CPUs to execute
  scripts of atomic 'instructions', where the instructions are high-level
  things like read block, write block, compute parity, etc. The transaction
  driver multiprocesses several scripts if necessary. (Although they describe
  it in the context of a RAID implementation it certainly could be used for
  other complex distributed services.) The instructions are often transaction
  pairs, which compile into a pair of instructions, one for this node and one
  for the remote node. This node sends the program to the remote node, and they
  execute them separately, keeping synchronized for transaction pairs when
  necessary. See also the newer paper in Computing Surveys, long:swift-raid.}
}

@Article{moon:declustering,
  author = {Bongki Moon and Joel H. Saltz},
  title = {Scalability Analysis of Declustering Methods for for
  Multidimensional Range Queries},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  year = {1997},
  note = {To appear},
  URL = {ftp://hpsl.cs.umd.edu/pub/papers/ieee_tkde.ps.Z},
  keywords = {parallel I/O, parallel database, declustering, pario-bib},
  abstract = {Efficient storage and retrieval of multi-attribute datasets have
  become one of the essential requirements for many data-intensive
  applications. The Cartesian product file has been known as an effective
  multi-attribute file structure for partial-match and best-match queries.
  Several heuristic methods have been developed to decluster Cartesian product
  files across multiple disks to obtain high performance for disk accesses.
  Though the scalability of the declustering methods becomes increasingly
  important for systems equipped with a large number of disks, no analytic
  studies have been done so far. In this paper we derive formulas describing
  the scalability of two popular declustering methods Disk Modulo and Fieldwise
  Xor for range queries, which are the most common type of queries. These
  formulas disclose the limited scalability of the declustering methods and are
  corroborated by extensive simulation experiments. From the practical point of
  view, the formulas given in this paper provide a simple measure which can be
  used to predict the response time of a given range query and to guide the
  selection of a declustering method under various conditions.}
}

@Article{moore:ddio,
  author = {Jason A. Moore and Michael J. Quinn},
  title = {Enhancing Disk-Directed {I/O} for Fine-Grained Redistribution of
  File Data},
  journal = {Parallel Computing},
  year = {1997},
  month = {June},
  volume = {23},
  number = {4},
  pages = {477--499},
  publisher = {North-Holland (Elsevier Scientific)},
  keywords = {parallel I/O, multiprocessor file system, interprocessor
  communication, pario-bib},
  comment = {They propose several enhancements to disk-directed I/O (see
  kotz:diskdir) that aim to improve performance on fine-grained distributions,
  that is, where each block from the disk is broken into small pieces that are
  scattered among the compute processors. One enhancement combines multiple
  pieces, possibly from separate disk blocks, into a single message. Another is
  to use two-phase I/O (see delrosario:two-phase), but to use disk-directed I/O
  to read data from the disks into CP memories, efficiently, then permute. This
  latter technique is probably faster than normal two-phase I/O that uses a
  traditional file system, not disk-directed I/O, for the read.}
}

@InProceedings{moore:detection,
  author = {Jason A. Moore and Philip J. Hatcher and Michael J. Quinn},
  title = {Efficient Data-Parallel Files via Automatic Mode Detection},
  booktitle = {Proceedings of the Fourth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1996},
  month = {May},
  pages = {1--14},
  publisher = {ACM Press},
  address = {Philadelphia},
  keywords = {parallel I/O, data parallelism, pario-bib},
  abstract = {Parallel languages rarely specify parallel I/O constructs, and
  existing commercial systems provide the programmer with a low-level I/O
  interface. We present design principles for integrating I/O into languages
  and show how these principles are applied to a virtual-processor-oriented
  language. We illustrate how machine-independent modes are used to support
  both high performance and generality. We describe an automatic mode detection
  technique that saves the programmer from extra syntax and low-level file
  system details. We show how virtual processor file operations, typically
  small by themselves, are combined into efficient large-scale file system
  calls. Finally, we present a variety of benchmark results detailing design
  tradeoffs and the performance of various modes.},
  comment = {Updated version of TR 95-80-9. See moore:stream. Interesting
  approach, where they permit a fairly normal fread and fwrite kind of
  interface, with each VP having its own stream. They choose their own format
  for the file, and switch between formats (and internal buffering) depending
  on the particulars of the fread and fwrite parameters. They seem to have good
  performance, and a familiar interface. They are left with a non-standard file
  format.}
}

@TechReport{moore:ocean,
  author = {Jason A. Moore},
  title = {Parallel {I/O} Requirements of Four Oceanography Applications},
  year = {1995},
  month = {January},
  number = {95-80-1},
  institution = {Oregon State University},
  keywords = {data parallel, file system workload, parallel I/O, pario-bib},
  abstract = {Brief descriptions of the I/O requirements for four production
  oceanography programs running at Oregon State University are presented. The
  applications all rely exclusively on array-oriented, sequential file
  operations. Persistent files are used for checkpointing and movie making,
  while temporary files are used to store out-of-core data.},
  comment = {See moore:detection, moore:stream. Only three pages.}
}

@InProceedings{moore:stream,
  author = {Jason A. Moore and Philip J. Hatcher and Michael J. Quinn},
  title = {Stream*: Fast, Flexible, Data-parallel {I/O}},
  booktitle = {Parallel Computing: State-of-the-Art and Perspectives
  (ParCo~'95)},
  year = {1995},
  month = {September},
  pages = {287--294},
  publisher = {Elsevier Science},
  earlier = {moore:stream-tr},
  keywords = {data parallel, parallel I/O, pario-bib}
}

@TechReport{moore:stream-tr,
  author = {Jason A. Moore and Philip J. Hatcher and Michael J. Quinn},
  title = {Stream*: Fast, Flexible, Data-parallel {I/O}},
  year = {1994},
  number = {94-80-13},
  institution = {Oregon State University},
  note = {Updated September 1995.},
  later = {moore:stream},
  keywords = {data parallel, parallel I/O, pario-bib},
  abstract = {Although hardware supporting parallel file I/O has improved
  greatly since the introduction of first-generation parallel computers, the
  programming interface has not. Each vendor provides a different logical view
  of parallel files as well as nonportable operations for manipulating files.
  Neither do parallel languages provide standards for performing I/O. In this
  paper, we describe a view of parallel files for data-parallel languages,
  dubbed Stream*, in which each virtual processor writes to and reads from its
  own stream. In this scheme each virtual processor's I/O operations have the
  same familiar, unambiguous meaning as in a sequential C program. We
  demonstrate how I/O operations in Stream* can run as fast as those of
  vendor-specific parallel file systems on the operations most often
  encountered in data-parallel programs. We show how this system supports
  general virtual processor operations for debugging and elemental functions.
  Finally, we present empirical results from a prototype Stream* system running
  on a Meiko CS-2 multicomputer.},
  comment = {See moore:stream; nearly identical. See also moore:detection. This
  paper gives a little bit earlier description of the Stream* idea than does
  moore:detection, but you'd be pretty much complete just reading
  moore:detection.}
}

@InProceedings{moran:imad,
  author = {David Moran and Gary Ditlow and Daria Dooling and Ralph Williams
  and Tom Wilkins},
  title = {Integrated Manufacturing and Design},
  booktitle = {Proceedings of SC99: High Performance Networking and Computing},
  year = {1999},
  month = {November},
  publisher = {ACM Press and IEEE Computer Society Press},
  address = {Portland, OR},
  URL = {http://www.sc99.org/proceedings/papers/moran.pdf},
  keywords = {manufacturing, integrated chip, parallel I/O, pario-bib},
  comment = {They describe "IMaD", a parallel code that used to support product
  engineering of full-scale integrated circuits. The code itself simulates the
  entire integrated circuit to address three primary apects of product
  engineering: to assure the an IC is manufacturable, to monitor its lifetime
  yeild and reliability, and to support IC test and failure analysis. The
  simulation is computationally, memory and I/O intensive. While the paper
  primarily describes the model and the simulation equations, the talk
  addressed the issue of parallel I/O, where the data for each processor was
  written to a separate disk. Not exactly a novel approach, but it emphasises
  the fact that the I/O requirements are large enough that they used an
  approach other than a standard serial method.}
}

@InProceedings{more:mtio,
  author = {Sachin More and Alok Choudhary and Ian Foster and Ming Q. Xu},
  title = {{MTIO:} A Multi-Threaded Parallel {I/O} System},
  booktitle = {Proceedings of the Eleventh International Parallel Processing
  Symposium},
  year = {1997},
  month = {April},
  pages = {368--373},
  URL = {http://www.ece.nwu.edu/~ssmore/ipps97.ps},
  keywords = {threads, parallel I/O, pario-bib},
  abstract = {This paper presents the design and evaluation of a multi-threaded
  runtime library for parallel I/O. We extend the multi-threading concept to
  separate the compute and I/O tasks in two separate threads of control.
  Multi-threading in our design permits a) asynchronous I/O even if the
  underlying file system does not support asynchronous I/O; b) copy avoidance
  from the I/O thread to the compute thread by sharing address space; and c) a
  capability to perform collective I/O asynchronously without blocking the
  compute threads. Further, this paper presents techniques for collective I/O
  which maximize load balance and concurrency while reducing communication
  overhead in an integrated fashion. Performance results on IBM SP2 for various
  data distributions and access patterns are presented. The results show that
  there is a tradeoff between the amount of concurrency in I/O and the buffer
  size designated for I/O; and there is an optimal buffer size beyond which
  benefits of larger requests diminish due to large communication overheads.}
}

@Article{moren:controllers,
  author = {William D. Moren},
  title = {Design of Controllers is Key Element in Disk Subsystem Throughput},
  journal = {Computer Technology Review},
  year = {1988},
  month = {Spring},
  pages = {71--73},
  keywords = {parallel I/O, disk architecture, pario-bib},
  comment = {A short paper on some basic techniques used by disk controllers to
  improve throughput: seek optimization, request combining, request queuing,
  using multiple drives in parallel, scatter/gather DMA, data caching,
  read-ahead, cross-track read-ahead, write-back caching, segmented caching,
  reduced latency (track buffering), and format skewing. [Most of these are
  already handled in Unix file systems.]}
}

@InProceedings{mourad:raid,
  author = {Antoine N. Mourad and W. Kent Fuchs and Daniel G. Saab},
  title = {Performance of Redundant Disk Array Organizations in Transaction
  Processing Environments},
  booktitle = {Proceedings of the 1993 International Conference on Parallel
  Processing},
  year = {1993},
  pages = {I--138--145},
  publisher = {CRC Press},
  address = {St. Charles, IL},
  keywords = {parallel I/O, disk array, pario-bib, RAID},
  comment = {Transaction-processing workload dominated by small I/Os. They
  compare RAID~5, Parity Striping (which was designed for TP because it avoids
  lots of seeks on medium-sized requests, by declustering parity but not data),
  mirroring, and RAID~0. RAID~5 does {\em better\/} than parity striping due to
  its load balancing ability on the skewed workload. RAID~5 also better as the
  load increases.}
}

@InProceedings{mowry:prefetch,
  author = {Todd C. Mowry and Angela K. Demke and Orran Krieger},
  title = {Automatic compiler-inserted {I/O} prefetching for out-of-core
  applications},
  booktitle = {Proceedings of the 1996 Symposium on Operating Systems Design
  and Implementation},
  year = {1996},
  month = {October},
  pages = {3--17},
  publisher = {USENIX Association},
  later = {mowry:jprefetch},
  URL =
  {http://www.usenix.org/publications/library/proceedings/osdi96/mowry.html},
  keywords = {compiler, prefetch, parallel I/O, pario-bib},
  abstract = {Current operating systems offer poor performance when a numeric
  application's working set does not fit in main memory. As a result,
  programmers who wish to solve ``out-of-core'' problems efficiently are
  typically faced with the onerous task of rewriting an application to use
  explicit I/O operations (e.g., read/write). In this paper, we propose and
  evaluate a fully-automatic technique which liberates the programmer from this
  task, provides high performance, and requires only minimal changes to current
  operating systems. In our scheme, the compiler provides the crucial
  information on future access patterns without burdening the programmer, the
  operating system supports non-binding prefetch and release hints for managing
  I/O, and the operating system cooperates with a run-time layer to accelerate
  performance by adapting to dynamic behavior and minimizing prefetch overhead.
  This approach maintains the abstraction of unlimited virtual memory for the
  programmer, gives the compiler the flexibility to aggressively move
  prefetches back ahead of references, and gives the operating system the
  flexibility to arbitrate between the competing resource demands of multiple
  applications. We have implemented our scheme using the SUIF compiler and the
  Hurricane operating system. Our experimental results demonstrate that our
  fully-automatic scheme effectively hides the I/O latency in out-of-core
  versions of the entire NAS Parallel benchmark suite, thus resulting in
  speedups of roughly twofold for five of the eight applications, with two
  applications speeding up by threefold or more.},
  comment = {Best Paper Award.}
}

@Article{moyer:application,
  author = {S. Moyer and V. S. Sunderam},
  title = {Parallel {I/O} as a Parallel Application},
  journal = {International Journal of Supercomputer Applications},
  year = {1995},
  month = {Summer},
  volume = {9},
  number = {2},
  pages = {95--107},
  keywords = {parallel I/O, pario-bib},
  comment = {An overview of PIOUS and its performance. Results for partitioned
  and self-scheduled access pattern. See other moyer:* papers. The big thing
  about PIOUS over previous parallel file systems is its internal use of
  transactions for concurrency control and user-selectable fault-tolerance
  guarantees, and its optional support of user-level transactions.}
}

@TechReport{moyer:characterize,
  author = {Steven A. Moyer and V.~S. Sunderam},
  title = {Characterizing Concurrency Control Performance for the {PIOUS}
  Parallel File System},
  year = {1995},
  month = {June},
  number = {CSTR-950601},
  institution = {Emory University},
  later = {moyer:jcharacterize},
  URL = {ftp://ftp.mathcs.emory.edu/pub/cstr/CSTR950601.ps},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  abstract = {Parallel file systems employ data declustering to increase I/O
  throughput. But because a single read or write operation can generate data
  accesses on multiple independent storage devices, a concurrency control
  mechanism must be employed to retain familiar file access semantics.
  Concurrency control negates some of the performance benefits of data
  declustering by introducing additional file access overhead. This paper
  examines the performance characteristics of the transaction-based concurrency
  control mechanism implemented in the PIOUS parallel file system. Results
  demonstrate that linearizability of file access operations is provided
  without loss of scalability or stability.},
  comment = {``substantially different material than presented in a previous
  report,'' moyer:scalable-tr. But it seems like the moyer:scalable IOPADS
  paper is largely a subset of this TR. He describes how they use volatile
  transactions, and does some experiments with PIOUS to measure their
  efficiency. Basically, they use a 2-phase commit protocol, using timeouts to
  detect deadlock and transaction aborts to remedy the deadlock. Results for
  partitioned and sequential access patterns.}
}

@Article{moyer:jcharacterize,
  author = {Steven A. Moyer and V.S. Sunderam},
  title = {Characterizing Concurrency Control Performance for the {PIOUS}
  Parallel File System},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1996},
  month = {October},
  volume = {38},
  number = {1},
  pages = {81--91},
  earlier = {moyer:characterize},
  keywords = {parallel I/O, multiprocessor file system, pario-bib}
}

@InProceedings{moyer:pario,
  author = {Steven A. Moyer and V. S. Sunderam},
  title = {A Parallel {I/O} System for High-Performance Distributed Computing},
  booktitle = {Proceedings of the IFIP WG10.3 Working Conference on Programming
  Environments for Massively Parallel Distributed Systems},
  year = {1994},
  URL = {ftp://ftp.mathcs.emory.edu/pub/vss/piousifip94.ps},
  keywords = {parallel I/O, parallel file system, workstation cluster, file
  system interface, pario-bib},
  comment = {See moyer:pious. A further description of the PIOUS parallel file
  system for cluster computing. (Beta-test version available for ftp). They
  support parafiles, which are collections of segments, each segment residing
  on a different server. The segments can be viewed separately or can be
  interleaved into a linear sequence using an arbitrary chunk size. They also
  support transactions to support sequential consistency.}
}

@InProceedings{moyer:pious,
  author = {Steven A. Moyer and V. S. Sunderam},
  title = {{PIOUS:} A Scalable Parallel {I/O} System for Distributed Computing
  Environments},
  booktitle = {Proceedings of the Scalable High-Performance Computing
  Conference},
  year = {1994},
  pages = {71--78},
  URL = {ftp://ftp.mathcs.emory.edu/pub/vss/piousshpcc94.ps.Z},
  keywords = {parallel I/O, parallel file system, workstation cluster, file
  system interface, pario-bib},
  comment = {Basically, I/O for clusters of workstations; ideally, it is
  parallel, heterogeneous, fault tolerant, etc. File servers are independent,
  have only a local view. Single server used to coordinate open(). Client
  libraries implement the API and depend on the servers only for storage
  mechanism. Servers use transactions internally -- but usually these are
  lightweight transactions, only used for concurrency control and not recovery.
  Full transactions are supported for times when the user wants the extra fault
  tolerance. They have files that are in some sense 2-dimensional. Sequential
  consistency. User-controllable fault tolerance. Performance: 2 clients max
  out the transport (ethernet). ``Stable'' mode is slow, as is self-scheduled
  mode. No client caching. See moyer:pario.}
}

@InProceedings{moyer:scalable,
  author = {Steven A. Moyer and V. S. Sunderam},
  title = {Scalable Concurrency Control for Parallel File Systems},
  booktitle = {Proceedings of the IPPS~'95 Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1995},
  month = {April},
  pages = {90--106},
  earlier = {moyer:scalable-tr},
  later = {moyer:scalable-book},
  keywords = {parallel I/O, pario-bib},
  abstract = {Parallel file systems employ data declustering to increase I/O
  throughput. As a result, a single read or write operation can generate
  concurrent data accesses on multiple storage devices. Unless a concurrency
  control mechanism is employed, familiar file access semantics are likely to
  be violated. This paper details the transaction-based concurrency control
  mechanism implemented in the PIOUS parallel file system. Performance results
  are presented demonstrating that sequential consistency semantics can be
  provided without loss of system scalability.},
  comment = {Seems to be a subset of moyer:scalable-tr, and for that matter,
  moyer:characterize. Results for partitioned access pattern.}
}

@InCollection{moyer:scalable-book,
  author = {Steven A. Moyer and V.~S. Sunderam},
  title = {Scalable Concurrency Control for Parallel File Systems},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {10},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {225--243},
  publisher = {Kluwer Academic Publishers},
  earlier = {moyer:scalable},
  keywords = {parallel I/O, parallel file system, concurrency control,
  synchronization, transaction, pario-bib},
  abstract = {Parallel file systems employ data declustering to increase
  \mbox{I/O} throughput. As a result, a single read or write operation can
  generate concurrent data accesses on multiple storage devices. Unless a
  concurrency control mechanism is employed, familiar file access semantics are
  likely to be violated. This paper details the transaction-based concurrency
  control mechanism implemented in the PIOUS parallel file system. Performance
  results are presented demonstrating that sequential consistency semantics can
  be provided without loss of system scalability.},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@TechReport{moyer:scalable-tr,
  author = {Steven A. Moyer and V.~S. Sunderam},
  title = {Scalable Concurrency Control for Parallel File Systems},
  year = {1995},
  month = {February},
  number = {CSTR-950202},
  institution = {Emory University},
  later = {moyer:scalable},
  URL = {ftp://ftp.mathcs.emory.edu/pub/cstr/CSTR950202.ps},
  keywords = {parallel I/O, parallel file system, pario-bib},
  abstract = {Parallel file systems employ data declustering to increase I/O
  throughput. As a result, a single read or write operation can generate
  concurrent data accesses on multiple storage devices. Unless a concurrency
  control mechanism is employed, familiar file access semantics are likely to
  be violated. This paper details the transaction-based concurrency control
  mechanism implemented in the PIOUS parallel file system. Performance results
  are presented demonstrating that sequential consistency semantics can be
  provided without loss of system scalability.},
  comment = {They describe {\em volatile transactions\/} as a way of providing
  the appopriate sequential consistency among file-read and -write operations
  (a feature not provided by most file systems). Their PIOUS library implements
  these transactions with strict 2-phase locking. They show some performance
  results, though only on a limited and relatively simple benchmark. If nothing
  else this paper reminds us all that atomicity of file-read and -write
  requests should be available to the user (eg, note how they are optional in
  Vesta). Published as moyer:scalable.}
}

@Misc{mpi-forum:mpi2,
  key = {MPI},
  title = {{MPI-2}: Extensions to the Message-Passing Interface},
  year = {1997},
  month = {July},
  howpublished = {{The MPI Forum}},
  earlier = {mpi-ioc:mpi-io5},
  URL = {http://www.mpi-forum.org/docs/docs.html},
  keywords = {parallel I/O, message-passing, multiprocessor file system
  interface, pario-bib},
  comment = {This is the definition of the MPI2 message-passing standard, which
  includes an interface for parallel I/O. Supercedes mpi-ioc:mpi-io5 and
  earlier versions. See the MPI2 web page at http://www.mpi-forum.org. The I/O
  section is at http://www.mpi-forum.org/docs/mpi-20-html/node172.html.}
}

@Misc{mpi-ioc:mpi-io5,
  key = {MPIO},
  title = {{MPI-IO:} A Parallel File {I/O} Interface for {MPI}},
  year = {1996},
  month = {April},
  howpublished = {{The MPI-IO Committee}},
  note = {Version 0.5.},
  earlier = {corbett:mpi-io4},
  later = {mpi-forum:mpi2},
  keywords = {parallel I/O, message-passing, multiprocessor file system
  interface, pario-bib},
  comment = {Supercedes corbett:mpi-io4 and earlier versions. See the MPI-IO
  Web page at http://parallel.nas.nasa.gov/MPI-IO/.}
}

@InBook{mpi2-io,
  author = {{Message-Passing Interface Forum}},
  title = {{MPI-2.0}: Extensions to the Message-Passing Interface},
  chapter = {9},
  year = {1997},
  month = {June},
  publisher = {MPI Forum},
  URL = {http://www.mpi-forum.org/docs/docs.html},
  keywords = {MPI, message passing, parallel computing, library, parallel I/O,
  pario-bib},
  comment = {Chapter 9 is about I/O extensions.}
}

@InProceedings{mueck:multikey,
  author = {T.~A. Mueck and J. Witzmann},
  title = {Multikey Index Support for Tuple Sets on Parallel Mass Storage
  Systems},
  booktitle = {Proceedings of the Fourteenth IEEE Symposium on Mass Storage
  Systems},
  year = {1995},
  month = {September},
  pages = {136--145},
  URL = {http://www.computer.org/conferen/mss95/mueck/mueck.htm},
  keywords = {parallel database, mass storage, parallel I/O, pario-bib},
  abstract = {The development and evaluation of a tuple set manager (TSM) based
  on multikey index data structures is a main part of the PARABASE project at
  the University of Vienna. The TSM provides access to parallel mass storage
  systems using tuple sets instead of conventional files as the central data
  structure for application programs. A proof-of-concept prototype TSM is
  already implemented and operational on an iPSC/2. It supports tuple insert
  and delete operations as well as exact match, partial match, and range
  queries at system call level. Available results are from this prototype on
  the one hand and from various performance evaluation figures. The evaluation
  results demonstrate the performance gain achieved by the implementation of
  the tuple set management concept on a parallel mass storage system.}
}

@InProceedings{muller:multi,
  author = {Keith Muller and Joseph Pasquale},
  title = {A High Performance Multi-Structured File System Design},
  booktitle = {Proceedings of the Thirteenth ACM Symposium on Operating Systems
  Principles},
  year = {1991},
  pages = {56--67},
  publisher = {ACM Press},
  address = {Pacific Grove, CA},
  keywords = {file system, disk striping, disk mirroring, pario-bib}
}

@InProceedings{muntz:failure,
  author = {Richard R. Muntz and John C. S. Lui},
  title = {Performance Analysis of Disk Arrays Under Failure},
  booktitle = {Proceedings of the 16th International Conference on Very Large
  Data Bases},
  year = {1990},
  pages = {162--173},
  keywords = {disk array, parallel, performance analysis, pario-bib},
  comment = {Looked at RAID5 when in failure mode. For small-reads workload,
  could only get 50\% of normal. So they decouple cluster size and parity-group
  size, so that they decluster over more disks than group size; during failure,
  this causes less of a load increase on surviving disks.}
}

@Article{muntz:intro,
  author = {Richard R. Muntz and Leana Golubchik},
  title = {Parallel Data Servers and Applications},
  journal = {Parallel Computing},
  year = {1998},
  month = {January},
  volume = {24},
  number = {1},
  pages = {1--4},
  keywords = {parallel I/O, multimedia, databases, pario-bib},
  comment = {Introduction to a special issue.}
}

@InProceedings{mutisya:cache,
  author = {Gerald Mutisya and Bradley M. Broom},
  title = {Distributed File Caching for the {AP1000}},
  booktitle = {Proceedings of the Third Fujitsu-ANU CAP Workshop},
  year = {1992},
  month = {November},
  keywords = {distributed file system, multiprocessor file system, pario-bib},
  comment = {See also broom:acacia, broom:impl, lautenbach:pfs, and broom:cap.
  They examine ways to manage a distributed file cache, without replication.
  Since there is no replication, the concurrency control problems boil down to
  providing atomicity for multi-block, multi-site requests. This is handled
  essentially by serializing the request: send the request to the first site,
  and have it forward the request from site to site as each block is processed.
  This works fine but completely serializes all multi-block requests, somewhat
  defeating the purpose. Thus, they get concurrency between requests, by having
  multiple servers, but no parallelism within requests.}
}

@Article{myllymaki:buffering,
  author = {Jussi Myllymaki and Miron Livny},
  title = {Efficient buffering for concurrent disk tape {I/O}},
  journal = {Performance Evaluation: An International Journal},
  year = {1996},
  volume = {27/28},
  pages = {453--471},
  note = {Performance~'96},
  keywords = {buffering, file caching, tertiary storage, tape robot, file
  migration, parallel I/O, pario-bib},
  comment = {Ways to use secondary and tertiary storage in parallel, and
  buffering mechanisms for applications with concurrent I/O requirements.}
}

@InProceedings{nagaraj:hpfs,
  author = {U. Nagaraj and U. S. Shukla and A. Paulraj},
  title = {Design and Evaluation of a High Performance File System for Message
  Passing Parallel Computers},
  booktitle = {Proceedings of the Fifth International Parallel Processing
  Symposium},
  year = {1991},
  pages = {549--554},
  keywords = {multiprocessor file system, pario-bib},
  comment = {They describe a file system for general message-passing,
  distributed-memory, separate I/O and compute node, multicomputers. They
  provide few details, although they cite a lot of their tech reports. There
  are a few simulation results, but none show anything unintuitive.}
}

@InProceedings{nagashima:pario,
  author = {Umpei Nagashima and Takashi Shibata and Hiroshi Itoh and Minoru
  Gotoh},
  title = {An Improvement of {I/O} Function for Auxiliary Storage: {Parallel
  I/O} for a Large Scale Supercomputing},
  booktitle = {Proceedings of the 1990 ACM International Conference on
  Supercomputing},
  year = {1990},
  pages = {48--59},
  keywords = {parallel I/O, pario-bib},
  comment = {Using parallel I/O channels to access striped disks, in parallel
  from a supercomputer. They {\em chain}\/ (i.e., combine) requests to a disk
  for large contiguous accesses.}
}

@InProceedings{nakajo:ionet,
  author = {H. Nakajo and S. Ohtani and T. Matsumoto and M. Kohata and K.
  Hiraki and Y. Kaneda},
  title = {An {I/O} Network for Architecture of the Distributed Shared-Memory
  Massively parallel computer JUMP-1},
  booktitle = {Proceedings of the 11th ACM International Conference on
  Supercomputing},
  year = {1997},
  month = {July},
  pages = {253--260},
  publisher = {ACM Press},
  keywords = {collective I/O, multiprocessor file system, parallel I/O,
  pario-bib}
}

@InProceedings{nakajo:jump1,
  author = {Hironori Nakajo},
  title = {A Simulation-based Evaluation of a Disk {I/O} Subsystem for a
  Massively Parallel Computer: {JUMP-1}},
  booktitle = {Proceedings of the Sixteenth International Conference on
  Distributed Computer Systems},
  year = {1996},
  month = {May},
  pages = {562--569},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, I/O architecture, pario-bib},
  abstract = {JUMP-1 is a distributed shared-memory massively parallel computer
  and is composed of multiple clusters of interconnected network called RDT
  (Recursive Diagonal Torus). Each cluster in JUMP-1 consists of 4 element
  processors, secondary cache memories, and 2 MBP (Memory Based Processor) for
  high-speed synchronization and communication among clusters. The I/O
  subsystem is connected to a cluster via a high-speed serial link called
  STAFF-Link. The I/O buffer memory is mapped onto the JUMP-1 global
  shared-memory to permit each I/O access operation as memory access. In this
  paper we describe evaluation of the fundamental performance of the disk I/O
  subsystem using event-driven simulation, and estimated performance with a
  Video On Demand (VOD) application.}
}

@Article{nastea:optimization,
  author = {S. Nastea and V. Sgarciu and M. Simonca},
  title = {Parallel {I/O} performance optimization},
  journal = {Revue Roumaine des Sciences Techniques Serie Electrotechnique et
  Energetique},
  year = {2000},
  volume = {45},
  number = {3},
  pages = {487--500},
  publisher = {Acad\'emie Roumaine, Editura Academiei Rom\^ane},
  keywords = {parallel I/O, pario-bib}
}

@InProceedings{natarajan:clusterio,
  author = {Chita Natarajan and Ravishankar K. Iyer},
  title = {Measurement and Simulation Based Performance Analysis of Parallel
  {I/O} in a High-Performance Cluster System},
  booktitle = {Proceedings of the 1996 IEEE Symposium on Parallel and
  Distributed Processing},
  year = {1996},
  month = {October},
  pages = {332--339},
  publisher = {IEEE Computer Society Press},
  keywords = {performance analysis, parallel I/O, pario-bib},
  abstract = {This paper presents a measurement and simulation based study of
  parallel I/O in a high-performance cluster system: the Pittsburgh
  Supercomputing Center (PSC) DEC Alpha Supercluster. The measurements were
  used to characterize the performance bottlenecks and the throughput limits at
  the compute and I/O nodes, and to provide realistic input parameters to
  PioSim, a simulation environment we have developed to investigate parallel
  I/O performance issues in cluster systems. PioSim was used to obtain a
  detailed characterization of parallel I/O performance, in the high
  performance cluster system, for different regular access patterns and
  different system configurations. This paper also explores the use of local
  disks at the compute nodes for parallel I/O, and finds that the local disk
  architecture outperforms the traditional parallel I/O over remote I/O node
  disks architecture, even when as much as 68-75\% of the requests from each
  compute node goes to remote disks.}
}

@TechReport{ncr:3600,
  key = {NCR},
  title = {{NCR 3600} Product Description},
  year = {1991},
  month = {September},
  number = {ST-2119-91},
  institution = {NCR},
  address = {San Diego},
  keywords = {multiprocessor architecture, MIMD, parallel I/O, pario-bib},
  comment = {Has 1-32 50MHz Intel 486 processors. Parallel independent disks on
  the disk nodes, separate from the processor nodes. Tree interconnect. Aimed
  at database applications.}
}

@InProceedings{ng:diskarray,
  author = {Spencer Ng},
  title = {Some Design Issues of Disk Arrays},
  booktitle = {Proceedings of IEEE Compcon},
  year = {1989},
  month = {Spring},
  pages = {137--142},
  note = {San Francisco, CA},
  keywords = {parallel I/O, disk array, pario-bib},
  comment = {Discusses disk arrays and striping. Transfer size is important to
  striping success: small size transfers are better off with independent disks.
  Synchronized rotation is especially important for small transfer sizes, since
  then the increased rotational delays dominate. Fine grain striping involves
  less assembly/disassembly delay, but coarse grain (block) striping allows for
  request parallelism. Fine grain striping wastes capacity due to fixed size
  formatting overhead. He also derives exact MTTF equation for 1-failure
  tolerance and on-line repair.}
}

@InProceedings{ng:interleave,
  author = {S. Ng and D. Lang and R. Selinger},
  title = {Trade-offs Between Devices and Paths in Achieving Disk
  Interleaving},
  booktitle = {Proceedings of the 15th Annual International Symposium on
  Computer Architecture},
  year = {1988},
  pages = {196--201},
  keywords = {parallel I/O, disk architecture, disk caching, I/O bottleneck,
  pario-bib},
  comment = {Compares four different ways of restructuring IBM disk controllers
  and channels to obtain more parallelism. They use parallel heads or parallel
  actuators. The best results come when they replicate the control electronics
  to maintain the number of data paths through the controller. Otherwise the
  controller bottleneck reduces performance. Generally, for large or small
  transfer sizes, parallel heads with replication gave better performance.}
}

@Article{nicastro:fft,
  author = {L. Nicastro and N. {D'Amico}},
  title = {An optimized mass storage {FFT} for vector computers},
  journal = {Parallel Computing},
  year = {1995},
  month = {March},
  volume = {21},
  pages = {423--432},
  publisher = {North-Holland (Elsevier Scientific)},
  keywords = {out-of-core algorithm, parallel I/O algorithm, scientific
  computing, vector computer, pario-bib},
  comment = {They describe an out-of-core FFT algorithm for vector computers
  (one disk, one vector processor). They implemented it on a Convex and show
  good performance. Basically, the segment the array, do FFTs on each segment,
  and do some transposing and other stuff to combine the segments. Each segment
  is basically a memoryload. Seems parallelizable too.}
}

@TechReport{nickolls:dpio,
  author = {John R. Nickolls and Ernie Rael},
  title = {Data Parallel {Unix} Input/Output for a Massively Parallel
  Processor},
  year = {1993},
  number = {MP/P-17.93},
  institution = {MasPar Computer Corporation},
  keywords = {Unix, parallel I/O, data parallel, pario-bib},
  comment = {Cite nickolls:maspar-io.}
}

@InProceedings{nickolls:maspar-io,
  author = {John R. Nickolls},
  title = {The {MasPar} Scalable {Unix I/O} System.},
  booktitle = {Proceedings of the Eighth International Parallel Processing
  Symposium},
  year = {1994},
  month = {April},
  pages = {390--394},
  address = {Cancun, Mexico},
  keywords = {parallel I/O, multiprocessor file system, SIMD, pario-bib},
  abstract = {Scalable parallel computers require I/O balanced with
  computational power to solve data-intensive problems. Distributed memory
  architectures call for I/O hardware and software beyond those of conventional
  scalar systems. \par This paper introduces the MasPar I/O system, designed to
  provide balanced and and scalable data-parallel Unix I/O. The architecture
  and implementation of the I/O hardware and software are described. Key
  elements include parallel access to conventional Unix file descriptors and a
  self-routing multistage network coupled with a buffer memory for flexible
  parallel I/O transfers. Performance measurements are presented for parallel
  Unix I/O with a scalable RAID disk array, a RAM disk, and a HIPPI
  interconnect.},
  comment = {This provides the definitive reference for the Maspar parallel-I/O
  architecture and file system. This paper includes a brief discussion of the
  interface and performance results. Also includes some HIPPI interface
  performance results. This paper is the conference version of nickolls:dpio,
  so cite this one.}
}

@InProceedings{nieplocha:arrays,
  author = {Jarek Nieplocha and Ian Foster},
  title = {Disk Resident Arrays: An Array-Oriented {I/O} Library for
  Out-Of-Core Computations},
  booktitle = {Proceedings of the Sixth Symposium on the Frontiers of Massively
  Parallel Computation},
  year = {1996},
  month = {October},
  pages = {196--204},
  publisher = {IEEE Computer Society Press},
  later = {foster:arrays},
  keywords = {parallel I/O, pario-bib},
  abstract = {In out-of-core computations, disk storage is treated as another
  level in the memory hierarchy, below cache, local memory, and (in a parallel
  computer) remote memories. However the tools used to manage this storage are
  typically quite different from those used to manage access to local and
  remote memory. This disparity complicates implementation of out-of-core
  algorithms and hinders portability. We describe a programming model that
  addresses this problem. This model allows parallel programs to use
  essentially the same mechanisms to manage the movement of data between any
  two adjacent levels in a hierarchical memory system. We take as our starting
  point the Global Arrays shared-memory model and library, which support a
  variety of operations on distributed arrays, including transfer between local
  and remote memories. We show how this model can be extended to support
  explicit transfer between global memory and secondary storage, and we define
  a Disk Resident Arrays Library that supports such transfers. We illustrate
  the utility of the resulting model with two applications, an out-of-core
  matrix multiplication and a large computational chemistry program. We also
  describe implementation techniques on several parallel computers and present
  experimental results that demonstrate that the Disk Resident Arrays model can
  be implemented very efficiently on parallel computers.}
}

@Article{nieplocha:chemio,
  author = {Jarek Nieplocha and Ian Foster and Rick Kendall},
  title = {{ChemIO}: High-Performance Parallel {I/O} for Computational
  Chemistry Applications},
  journal = {The International Journal of High Performance Computing
  Applications},
  year = {1998},
  month = {Fall},
  volume = {12},
  number = {3},
  pages = {345--363},
  earlier = {foster:chemio},
  keywords = {parallel I/O application, pario-bib},
  abstract = {Recent developments in I/O systems on scalable parallel computers
  have sparked renewed interest in out-of-core methods for computational
  chemistry. These methods can improve execution time significantly relative to
  "direct" methods, which perform many redundant computations. However, the
  widespread use of such out-of-core methods requires efficient and portable
  implementations of often complex I/O patterns. The ChemIO project has
  addressed this problem by defining an I/O interface that captures the I/O
  patterns found in important computational chemistry applications and by
  providing high-performance implementations of this interface on multiple
  platforms. This development not only broadens the user community for parallel
  I/O techniques but also provides new insights into the functionality required
  in general-purpose scalable I/O libraries and the techniques required to
  achieve high performance I/O on scalable parallel computers.},
  comment = {In a Special Issue on I/O in Parallel Applications, volume 12,
  numbers 3 and 4.}
}

@InProceedings{nieplocha:distant,
  author = {Jarek Nieplocha and Ian Foster and Holger Dachsel},
  title = {Distant {I/O}: One-Sided Access to Secondary Storage on Remote
  Processors},
  booktitle = {Proceedings of the Seventh IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1998},
  month = {July},
  pages = {148--154},
  publisher = {IEEE Computer Society Press},
  URL = {http://www.computer.org/proceedings/hpdc/8579/85790148abs.htm},
  keywords = {parallel I/O, pario-bib, remote I/O},
  abstract = {We propose a new parallel, noncollective I/O strategy called
  Distant I/O that targets clustered computer systems in which disks are
  attached to compute nodes. Distant I/O allows one-sided access to remote
  secondary storage without installing server processes or daemons on remote
  compute nodes. We implemented this model using Active Messages and
  demonstrated its performance advantages over the PIOFS parallel filesystem
  for an I/O-intensive parallel application on the IBM SP.}
}

@InProceedings{nieuwejaar:galley,
  author = {Nils Nieuwejaar and David Kotz},
  title = {The {Galley} Parallel File System},
  booktitle = {Proceedings of the 10th ACM International Conference on
  Supercomputing},
  year = {1996},
  month = {May},
  pages = {374--381},
  publisher = {ACM Press},
  copyright = {ACM},
  address = {Philadelphia},
  later = {nieuwejaar:jgalley-tr},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/nieuwejaar:galley.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/nieuwejaar:galley.pdf},
  keywords = {parallel file system, parallel I/O, multiprocessor file system
  interface, pario-bib, dfk},
  abstract = {As the I/O needs of parallel scientific applications increase,
  file systems for multiprocessors are being designed to provide applications
  with parallel access to multiple disks. Many parallel file systems present
  applications with a conventional Unix-like interface that allows the
  application to access multiple disks transparently. This interface conceals
  the parallelism within the file system, which increases the ease of
  programmability, but makes it difficult or impossible for sophisticated
  programmers and libraries to use knowledge about their I/O needs to exploit
  that parallelism. Furthermore, most current parallel file systems are
  optimized for a different workload than they are being asked to support. We
  introduce Galley, a new parallel file system that is intended to efficiently
  support realistic parallel workloads. We discuss Galley's file structure and
  application interface, as well as an application that has been implemented
  using that interface.},
  comment = {See also nieuwejaar:galley-perf. Also available at
  http://www.acm.org/pubs/citations/proceedings/supercomputing/237578/p374-nieuwejaar/}
}

@InProceedings{nieuwejaar:galley-perf,
  author = {Nils Nieuwejaar and David Kotz},
  title = {Performance of the {Galley} Parallel File System},
  booktitle = {Proceedings of the Fourth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1996},
  month = {May},
  pages = {83--94},
  publisher = {ACM Press},
  copyright = {ACM},
  address = {Philadelphia},
  later = {nieuwejaar:jgalley-tr},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/nieuwejaar:galley-perf.ps.gz},
  URLpdf =
  {http://www.cs.dartmouth.edu/~dfk/papers/nieuwejaar:galley-perf.pdf},
  keywords = {parallel file system, parallel I/O, multiprocessor file system
  interface, pario-bib, dfk},
  abstract = {As the I/O needs of parallel scientific applications increase,
  file systems for multiprocessors are being designed to provide applications
  with parallel access to multiple disks. Many parallel file systems present
  applications with a conventional Unix-like interface that allows the
  application to access multiple disks transparently. This interface conceals
  the parallelism within the file system, which increases the ease of
  programmability, but makes it difficult or impossible for sophisticated
  programmers and libraries to use knowledge about their I/O needs to exploit
  that parallelism. Furthermore, most current parallel file systems are
  optimized for a different workload than they are being asked to support. We
  introduce Galley, a new parallel file system that is intended to efficiently
  support realistic parallel workloads. Initial experiments, reported in this
  paper, indicate that Galley is capable of providing high-performance I/O to
  applications that access data in patterns that have been observed to be
  common.},
  comment = {See also nieuwejaar:galley.}
}

@Article{nieuwejaar:jgalley,
  author = {Nils Nieuwejaar and David Kotz},
  title = {The {Galley} Parallel File System},
  journal = {Parallel Computing},
  year = {1997},
  month = {June},
  volume = {23},
  number = {4},
  pages = {447--476},
  publisher = {North-Holland (Elsevier Scientific)},
  copyright = {North-Holland (Elsevier Scientific)},
  earlier = {nieuwejaar:jgalley-tr},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/nieuwejaar:jgalley.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/nieuwejaar:jgalley.pdf},
  keywords = {parallel file system, parallel I/O, multiprocessor file system
  interface, pario-bib, dfk},
  abstract = {Most current multiprocessor file systems are designed to use
  multiple disks in parallel, using the high aggregate bandwidth to meet the
  growing I/O requirements of parallel scientific applications. Many
  multiprocessor file systems provide applications with a conventional
  Unix-like interface, allowing the application to access multiple disks
  transparently. This interface conceals the parallelism within the file
  system, increasing the ease of programmability, but making it difficult or
  impossible for sophisticated programmers and libraries to use knowledge about
  their I/O needs to exploit that parallelism. In addition to providing an
  insufficient interface, most current multiprocessor file systems are
  optimized for a different workload than they are being asked to support. We
  introduce Galley, a new parallel file system that is intended to efficiently
  support realistic scientific multiprocessor workloads. We discuss Galley's
  file structure and application interface, as well as the performance
  advantages offered by that interface.},
  comment = {A revised version of nieuwejaar:jgalley-tr, which is a combination
  of nieuwejaar:galley and nieuwejaar:galley-perf.}
}

@TechReport{nieuwejaar:jgalley-tr,
  author = {Nils Nieuwejaar and David Kotz},
  title = {The {Galley} Parallel File System},
  year = {1996},
  month = {May},
  number = {PCS-TR96-286},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  earlier = {nieuwejaar:galley},
  later = {nieuwejaar:jgalley},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/133/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/nieuwejaar:jgalley-tr.pdf},
  keywords = {parallel file system, parallel I/O, multiprocessor file system
  interface, pario-bib, dfk},
  abstract = {Most current multiprocessor file systems are designed to use
  multiple disks in parallel, using the high aggregate bandwidth to meet the
  growing I/O requirements of parallel scientific applications. Many
  multiprocessor file systems provide applications with a conventional
  Unix-like interface, allowing the application to access multiple disks
  transparently. This interface conceals the parallelism within the file
  system, increasing the ease of programmability, but making it difficult or
  impossible for sophisticated programmers and libraries to use knowledge about
  their I/O needs to exploit that parallelism. In addition to providing an
  insufficient interface, most current multiprocessor file systems are
  optimized for a different workload than they are being asked to support. We
  introduce Galley, a new parallel file system that is intended to efficiently
  support realistic scientific multiprocessor workloads. We discuss Galley's
  file structure and application interface, as well as the performance
  advantages offered by that interface.}
}

@TechReport{nieuwejaar:strided,
  author = {Nils Nieuwejaar and David Kotz},
  title = {A Multiprocessor Extension to the Conventional File System
  Interface},
  year = {1994},
  month = {September},
  number = {PCS-TR94-230},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  later = {nieuwejaar:strided2-tr},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/103/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/nieuwejaar:strided.pdf},
  keywords = {parallel I/O, multiprocessor file system, pario-bib, dfk},
  abstract = {As the I/O needs of parallel scientific applications increase,
  file systems for multiprocessors are being designed to provide applications
  with parallel access to multiple disks. Many parallel file systems present
  applications with a conventional Unix-like interface that allows the
  application to access multiple disks transparently. By tracing all the
  activity of a parallel file system in a production, scientific computing
  environment, we show that many applications exhibit highly regular, but
  non-consecutive I/O access patterns. Since the conventional interface does
  not provide an efficient method of describing these patterns, we present an
  extension which supports {\em strided} and {\em nested-strided} I/O
  requests.}
}

@InProceedings{nieuwejaar:strided2,
  author = {Nils Nieuwejaar and David Kotz},
  title = {Low-level Interfaces for High-level Parallel {I/O}},
  booktitle = {Proceedings of the IPPS~'95 Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1995},
  month = {April},
  pages = {47--62},
  copyright = {the authors},
  identical = {nieuwejaar:strided2-tr},
  later = {nieuwejaar:strided2-book},
  keywords = {parallel I/O, multiprocessor file system, pario-bib, dfk},
  abstract = {As the I/O needs of parallel scientific applications increase,
  file systems for multiprocessors are being designed to provide applications
  with parallel access to multiple disks. Many parallel file systems present
  applications with a conventional Unix-like interface that allows the
  application to access multiple disks transparently. By tracing all the
  activity of a parallel file system in a production, scientific computing
  environment, we show that many applications exhibit highly regular, but
  non-consecutive I/O access patterns. Since the conventional interface does
  not provide an efficient method of describing these patterns, we present
  three extensions to the interface that support {\em strided}, {\em
  nested-strided}, and {\em nested-batched} I/O requests. We show how these
  extensions can be used to express common access patterns.},
  comment = {Identical to revised TR95-253, nieuwejaar:strided2-tr. Cite
  nieuwejaar:strided2-book.}
}

@InCollection{nieuwejaar:strided2-book,
  author = {Nils Nieuwejaar and David Kotz},
  title = {Low-level Interfaces for High-level Parallel {I/O}},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {9},
  editor = {Ravi Jain and John Werth and James C. Browne},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {205--223},
  publisher = {Kluwer Academic Publishers},
  copyright = {Kluwer Academic Publishers},
  earlier = {nieuwejaar:strided2},
  keywords = {parallel I/O, multiprocessor file system, pario-bib, dfk},
  abstract = {As the I/O needs of parallel scientific applications increase,
  file systems for multiprocessors are being designed to provide applications
  with parallel access to multiple disks. Many parallel file systems present
  applications with a conventional Unix-like interface that allows the
  application to access multiple disks transparently. By tracing all the
  activity of a parallel file system in a production, scientific computing
  environment, we show that many applications exhibit highly regular, but
  non-consecutive I/O access patterns. Since the conventional interface does
  not provide an efficient method of describing these patterns, we present
  three extensions to the interface that support {\em strided}, {\em
  nested-strided}, and {\em nested-batched} I/O requests. We show how these
  extensions can be used to express common access patterns.},
  comment = {Part of a whole book on parallel I/O; see iopads-book and
  nieuwejaar:strided2 (which is not much different).}
}

@TechReport{nieuwejaar:strided2-tr,
  author = {Nils Nieuwejaar and David Kotz},
  title = {Low-level Interfaces for High-level Parallel {I/O}},
  year = {1995},
  month = {March},
  number = {PCS-TR95-253},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  note = {Revised 4/18/95 and appeared in IOPADS workshop at IPPS~'95},
  identical = {nieuwejaar:strided2},
  earlier = {nieuwejaar:strided},
  URL = {https://digitalcommons.dartmouth.edu/facoa/3338/},
  URLpdf =
  {http://www.cs.dartmouth.edu/~dfk/papers/nieuwejaar:strided2-tr.pdf},
  keywords = {parallel I/O, multiprocessor file system, pario-bib, dfk},
  abstract = {As the I/O needs of parallel scientific applications increase,
  file systems for multiprocessors are being designed to provide applications
  with parallel access to multiple disks. Many parallel file systems present
  applications with a conventional Unix-like interface that allows the
  application to access multiple disks transparently. By tracing all the
  activity of a parallel file system in a production, scientific computing
  environment, we show that many applications exhibit highly regular, but
  non-consecutive I/O access patterns. Since the conventional interface does
  not provide an efficient method of describing these patterns, we present
  three extensions to the interface that support {\em strided}, {\em
  nested-strided}, and {\em nested-batched} I/O requests. We show how these
  extensions can be used to express common access patterns.},
  comment = {After revision, identical to nieuwejaar:strided2.}
}

@PhdThesis{nieuwejaar:thesis,
  author = {Nils A. Nieuwejaar},
  title = {Galley: A New Parallel File System for Parallel Applications},
  year = {1996},
  month = {November},
  school = {Dept. of Computer Science, Dartmouth College},
  copyright = {Nils A. Nieuwejaar},
  note = {Available as Dartmouth Technical Report PCS-TR96-300},
  URL = {ftp://ftp.cs.dartmouth.edu/TR/TR96-300.ps.Z},
  keywords = {parallel I/O, multiprocessor file system, file system workload
  characterization, file access patterns, file system interface, pario-bib},
  abstract = {Most current multiprocessor file systems are designed to use
  multiple disks in parallel, using the high aggregate bandwidth to meet the
  growing I/O requirements of parallel scientific applications. Most
  multiprocessor file systems provide applications with a conventional
  Unix-like interface, allowing the application to access those multiple disks
  transparently. This interface conceals the parallelism within the file
  system, increasing the ease of programmability, but making it difficult or
  impossible for sophisticated application and library programmers to use
  knowledge about their I/O to exploit that parallelism. In addition to
  providing an insufficient interface, most current multiprocessor file systems
  are optimized for a different workload than they are being asked to support.
  \par In this work we examine current multiprocessor file systems, as well as
  how those file systems are used by scientific applications. Contrary to the
  expectations of the designers of current parallel file systems, the workloads
  on those systems are dominated by requests to read and write small pieces of
  data. Furthermore, rather than being accessed sequentially and contiguously,
  as in uniprocessor and supercomputer workloads, files in multiprocessor file
  systems are accessed in regular, structured, but non-contiguous patterns.
  \par Based on our observations of multiprocessor workloads, we have designed
  Galley, a new parallel file system that is intended to efficiently support
  realistic scientific multiprocessor workloads. In this work, we introduce
  Galley and discuss its design and implementation. We describe Galley's new
  three-dimensional file structure and discuss how that structure can be used
  by parallel applications to achieve higher performance. We introduce several
  new data-access interfaces, which allow applications to explicitly describe
  the regular access patterns we found to be common in parallel file system
  workloads. We show how these new interfaces allow parallel applications to
  achieve tremendous increases in I/O performance. Finally, we discuss how
  Galley's new file structure and data-access interfaces can be useful in
  practice.}
}

@Article{nieuwejaar:workload,
  author = {Nils Nieuwejaar and David Kotz and Apratim Purakayastha and Carla
  Schlatter Ellis and Michael Best},
  title = {File-Access Characteristics of Parallel Scientific Workloads},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {1996},
  month = {October},
  volume = {7},
  number = {10},
  pages = {1075--1089},
  publisher = {IEEE Computer Society Press},
  copyright = {IEEE},
  earlier = {nieuwejaar:workload-tr},
  URL = {http://www.computer.org/tpds/td1996/l1075abs.htm},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/nieuwejaar:workload.pdf},
  keywords = {parallel I/O, file system workload, workload characterization,
  file access pattern, multiprocessor file system, dfk, pario-bib},
  abstract = {Phenomenal improvements in the computational performance of
  multiprocessors have not been matched by comparable gains in I/O system
  performance. This imbalance has resulted in I/O becoming a significant
  bottleneck for many scientific applications. One key to overcoming this
  bottleneck is improving the performance of multiprocessor file systems. \par
  The design of a high-performance multiprocessor file system requires a
  comprehensive understanding of the expected workload. Unfortunately, until
  recently, no general workload studies of multiprocessor file systems have
  been conducted. The goal of the CHARISMA project was to remedy this problem
  by characterizing the behavior of several production workloads, on different
  machines, at the level of individual reads and writes. The first set of
  results from the CHARISMA project describe the workloads observed on an Intel
  iPSC/860 and a Thinking Machines CM-5. This paper is intended to compare and
  contrast these two workloads for an understanding of their essential
  similarities and differences, isolating common trends and platform-dependent
  variances. Using this comparison, we are able to gain more insight into the
  general principles that should guide multiprocessor file-system design.},
  comment = {See also kotz:workload, nieuwejaar:strided, ap:workload.}
}

@TechReport{nieuwejaar:workload-tr,
  author = {Nils Nieuwejaar and David Kotz and Apratim Purakayastha and Carla
  Schlatter Ellis and Michael Best},
  title = {File-Access Characteristics of Parallel Scientific Workloads},
  year = {1995},
  month = {August},
  number = {PCS-TR95-263},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  earlier = {kotz:workload},
  later = {nieuwejaar:workload},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/237/},
  URLpdf =
  {http://www.cs.dartmouth.edu/~dfk/papers/nieuwejaar:workload-tr.pdf},
  keywords = {parallel I/O, file system workload, workload characterization,
  file access pattern, multiprocessor file system, dfk, pario-bib},
  abstract = {Phenomenal improvements in the computational performance of
  multiprocessors have not been matched by comparable gains in I/O system
  performance. This imbalance has resulted in I/O becoming a significant
  bottleneck for many scientific applications. One key to overcoming this
  bottleneck is improving the performance of parallel file systems. \par The
  design of a high-performance parallel file system requires a comprehensive
  understanding of the expected workload. Unfortunately, until recently, no
  general workload studies of parallel file systems have been conducted. The
  goal of the CHARISMA project was to remedy this problem by characterizing the
  behavior of several production workloads, on different machines, at the level
  of individual reads and writes. The first set of results from the CHARISMA
  project describe the workloads observed on an Intel iPSC/860 and a Thinking
  Machines CM-5. This paper is intended to compare and contrast these two
  workloads for an understanding of their essential similarities and
  differences, isolating common trends and platform-dependent variances. Using
  this comparison, we are able to gain more insight into the general principles
  that should guide parallel file-system design.},
  comment = {See also nieuwejaar:strided, ap:workload.}
}

@Article{ninghui:pfs,
  author = {Sun Ninghui},
  title = {The design of parallel file systems},
  journal = {Chinese Journal of Computers},
  year = {1994},
  month = {December},
  volume = {17},
  number = {12},
  pages = {938--945},
  note = {In Chinese},
  keywords = {parallel file systems, parallel I/O, pario-bib},
  comment = {From the abstract, it doesn't appear to offer anything new, but
  it's hard to tell.}
}

@InProceedings{nishino:sfs,
  author = {H. Nishino and S. Naka and K Ikumi},
  title = {High Performance File System for Supercomputing Environment},
  booktitle = {Proceedings of Supercomputing '89},
  year = {1989},
  pages = {747--756},
  keywords = {supercomputer, file system, parallel I/O, pario-bib},
  comment = {A modification to the Unix file system to allow for supercomputer
  access. Workload: file size from few KB to few GB, I/O operation size from
  few bytes to hundreds of MB. Generally programs split into I/O-bound and
  CPU-bound parts. Sequential and random access. Needs: giant files (bigger
  than device), peak hardware performance for large files, NFS access. Their FS
  is built into Unix ``transparently''. Space allocated in clusters, rather
  than blocks; clusters might be as big as a cylinder. Allows for efficient,
  large files. Mentions parallel disks as part of a ``virtual volume'' but does
  not elaborate. Prefetching within a cluster.}
}

@InCollection{nitzberg:bcollective,
  author = {Bill Nitzberg and Virginia Lo},
  title = {Collective Buffering: Improving Parallel {I/O} Performance},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {19},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {271--281},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {nitzberg:collective},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, collective I/O, pario-bib},
  comment = {Part of jin:io-book; reformatted version of nitzberg:collective.}
}

@TechReport{nitzberg:cfs,
  author = {Bill Nitzberg},
  title = {Performance of the {iPSC/860 Concurrent File System}},
  year = {1992},
  month = {December},
  number = {RND-92-020},
  institution = {NAS Systems Division, NASA Ames},
  later = {krystynak:pario},
  URL =
  {http://www.nas.nasa.gov/NAS/TechReports/RNDreports/RND-92-020/RND-92-020.html},
  keywords = {Intel, parallel file system, performance measurement, parallel
  I/O, pario-bib},
  abstract = {Typical scientific applications require vast amounts of
  processing power coupled with significant I/O capacity. Highly parallel
  computer systems can provide processing power at low cost, but tend to lack
  I/O capacity. By evaluating the performance and scalability of the Intel
  iPSC/860 Concurrent File System (CFS), we can get an idea of the current
  state of parallel I/O performance. I ran three types of tests on the iPSC/860
  system at the Numerical Aerodynamic Simulation facility (NAS): broadcast,
  simulating initial data loading; partitioned, simulating reading and writing
  a one-dimensional decomposition; and interleaved, simulating reading and
  writing a two-dimensional decomposition. \par The CFS at NAS can sustain up
  to 7 megabytes per second writing and 8 megabytes per second reading.
  However, due to the limited disk cache size, partitioned read performance
  sharply drops to less than 1 megabyte per second on 128 nodes. In addition,
  interleaved read and write performance show a similar drop in performance for
  small block sizes. Although the CFS can sustain 70-80\% of peak I/O
  throughput, the I/O performance does not scale with the number of nodes. \par
  Obtaining maximum performance may require significant programming effort:
  pre-allocating files, overlapping computation and I/O, using large block
  sizes, and limiting I/O parallelism. A better approach would be to attack the
  problem by either fixing the CFS (e.g., add more cache to the I/O nodes), or
  hiding its idiosyncracies (e.g., implement a parallel I/O library).},
  comment = {Straightforward measurements of an iPSC/860 with 128 compute
  nodes, 10 I/O nodes, and 10 disks. This is a bigger system than has been
  measured before. Has some basic MB/s measurements for some features in Tables
  1--2. CFS bug prevents more than 2 asynch requests at a time. Another bug
  forced random-writes to use preallocated files. For low number of procs, they
  weren't able to pull the full disk bandwidth. Cache thrashing caused problems
  when they had a large number of procs, because each read prefetched 8 blocks,
  which were flushed by some other proc doing a subsequent read. Workaround by
  synchronizing procs to limit concurrency. Increasing cache size is the right
  answer, but is not scalable.}
}

@InProceedings{nitzberg:collective,
  author = {Bill Nitzberg and Virginia Lo},
  title = {Collective Buffering: Improving Parallel {I/O} Performance},
  booktitle = {Proceedings of the Sixth IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1997},
  month = {August},
  pages = {148--157},
  publisher = {IEEE Computer Society Press},
  address = {Portland, OR},
  later = {nitzberg:bcollective},
  keywords = {parallel I/O, collective I/O, pario-bib},
  abstract = {"Parallel I/O" is the support of a single parallel application
  run on many nodes; application data is distributed among the nodes, and is
  read or written to a single logical file, itself spread across nodes and
  disks. Parallel I/O is a mapping problem from the data layout in node memory
  to the file layout on disks. Since the mapping can be quite complicated and
  involve significant data movement, optimizing the mapping is critical for
  performance. \par We discuss our general model of the problem, describe four
  Collective Buffering algorithms we designed, and report experiments testing
  their performance on an Intel Paragon and an IBM SP2 both housed at NASA Ames
  Research Center. Our experiments show improvements of up to two order of
  magnitude over standard techniques and the potential to deliver peak
  performance with minimal hardware support.}
}

@TechReport{nitzberg:sc94tutorial,
  author = {Bill Nitzberg and Samuel A. Fineberg},
  title = {Parallel {I/O} on Highly Parallel Systems--- Supercomputing '94
  Tutorial {M11} Notes},
  year = {1994},
  month = {November},
  number = {NAS-94-005},
  institution = {NASA Ames Research Center},
  later = {nitzberg:sc95tutorial},
  URL =
  {http://www.nas.nasa.gov/NAS/TechReports/NASreports/NAS-94-005/NAS-94-005.html},
  keywords = {parallel I/O, tutorial, pario-bib},
  abstract = {Typical scientific applications require vast amounts of
  processing power coupled with significant I/O capacity. Highly parallel
  computer systems provide floating point processing power at low cost, but
  efficiently supporting a scientific workload also requires commensurate I/O
  performance. In order to achieve high I/O performance, these systems utilize
  parallelism in their I/O subsystems---supporting concurrent access to files
  by multiple nodes of a parallel application, and striping files across
  multiple disks. However, obtaining maximum I/O performance can require
  significant programming effort. \par This tutorial presents a snapshot of the
  state of I/O on highly parallel systems by comparing the well-balanced I/O
  performance of a traditional vector supercomputer (the Cray Y/MP C90) with
  the I/O performance of various highly parallel systems (Cray T3D, IBM SP-2,
  Intel iPSC/860 and Paragon, and Thinking Machines CM-5). In addition, the
  tutorial covers benchmarking techniques for evaluating current parallel I/O
  systems and techniques for improving parallel I/O performance. Finally, the
  tutorial presents several high level parallel I/O libraries and shows how
  they can help application programmers improve I/O performance.}
}

@TechReport{nitzberg:sc95tutorial,
  author = {Bill Nitzberg and Samuel A. Fineberg},
  title = {Parallel {I/O} on Highly Parallel Systems--- Supercomputing '95
  Tutorial {M6} Notes},
  year = {1995},
  month = {December},
  number = {NAS-95-022},
  institution = {NASA Ames Research Center},
  later = {nitzberg:sc94tutorial},
  URL =
  {http://www.nas.nasa.gov/NAS/TechReports/NASreports/NAS-95-022/NAS-95-022.html},
  keywords = {parallel I/O, tutorial, pario-bib},
  abstract = {Typical scientific applications require vast amounts of
  processing power coupled with significant I/O capacity. Highly parallel
  computer systems provide floating-point processing power at low cost, but
  efficiently supporting a scientific workload also requires commensurate I/O
  performance. To achieve high I/O performance, these systems use parallelism
  in their I/O subsystems, supporting concurrent access to files by multiple
  nodes of a parallel application and striping files across multiple disks.
  However, obtaining maximum I/O performance can require significant
  programming effort. This tutorial will present a comprehensive survey of the
  state of the art in parallel I/O from basic concepts to recent advances in
  the research community. Requirements, interfaces, architectures, and
  performance will be illustrated using concrete examples from commercial
  offerings (Cray T3D, IBM SP-2, Intel Paragon, Meiko CS-2, and workstation
  clusters) and academic research projects (MPI-IO, Panda, PASSION, PIOUS, and
  Vesta). The material covered is roughly 30\% beginner, 60\% intermediate, and
  10\% advanced.}
}

@PhdThesis{nitzberg:thesis,
  author = {William J. Nitzberg},
  title = {Collective Parallel {I/O}},
  year = {1995},
  month = {December},
  school = {Department of Computer and Information Science, University of
  Oregon},
  keywords = {parallel I/O, parallel algorithm, file system interface,
  pario-bib},
  abstract = {Parallel I/O, the process of transferring a global data structure
  distributed among compute nodes to a file striped across storage devices, can
  be quite complicated and involve a significant amount of data movement.
  Optimizing parallel I/O with respect to data distribution, file layout, and
  machine architecture is critical for performance. In this work, we propose a
  solution to both the performance and portability problems plaguing the wide
  acceptance of distributed memory parallel computers for scientific computing:
  a collective parallel I/O interface and efficient algorithms to implement it.
  A collective interface allows the programmer to specify a file access as a
  high-level global operation rather than as a series of seeks and writes. This
  not only provides a more natural interface for the programmer, but also
  provides the system with both the opportunity and the semantic information
  necessary to optimize the file operation. \par We attack this problem in
  three steps: we evaluate an early parallel I/O system, the Intel iPSC/860
  Concurrent File System; we design and analyze the performance of two classes
  of algorithms taking advantage of collective parallel I/O; and we design
  MPI-IO, a collective parallel I/O interface likely to become the standard for
  portable parallel I/O. \par The collective I/O algorithms fall into two broad
  categories: data block scheduling and collective buffering. Data block
  scheduling algorithms attempt to schedule the individual data transfers to
  minimize resource contention and to optimize for particular hardware
  characteristics. We develop and evaluate three data block scheduling
  algorithms: Grouping, Random, and Sliding Window. The data block scheduling
  algorithms improved performance by as much as a factor of eight. The
  collective buffering algorithms permute the data before writing or after
  reading in order to combine small file accesses into large blocks. We design
  and test a series of four collective buffering algorithms and demonstrate
  improvement in performance by two orders of magnitude over naive file I/O for
  the hardest, three-dimensional distributions.},
  comment = {See also nitzberg:cfs and corbett:mpi-overview.}
}

@InProceedings{no:file-db,
  author = {Jaechun No and Rajeev Thakur and Alok Choudhary},
  title = {Integrating Parallel File {I/O} and Database Support for
  High-Performance Scientific Data Management},
  booktitle = {Proceedings of SC2000: High Performance Networking and
  Computing},
  year = {2000},
  month = {November},
  publisher = {IEEE Computer Society Press},
  address = {Dallas, TX},
  note = {To appear},
  URL = {http://www.mcs.anl.gov/~thakur/papers/sdm.ps},
  keywords = {scientific computing, database, parallel I/O, pario-bib},
  abstract = {Many scientific applications have large I/O requirements, in
  terms of both the size of data and the number of files or data sets.
  Management, storage, efficient access, and analysis of this data present an
  extremely challenging task. Traditionally, two different solutions are used
  for this problem: file I/O or databases. File I/O can provide high
  performance but is tedious to use with large numbers of files and large and
  complex data sets. Databases can be convenient, flexible, and powerful but do
  not perform and scale well for parallel supercomputing applications. We have
  developed a software system, called Scientific Data Manager (SDM), that aims
  to combine the good features of both file I/O and databases. SDM provides a
  high-level API to the user and, internally, uses a parallel file system to
  store real data and a database to store application-related metadata. SDM
  takes advantage of various I/O optimizations available in MPI-IO, such as
  collective I/O and noncontiguous requests, in a manner that is transparent to
  the user. As a result, users can write and retrieve data with the performance
  of parallel file I/O, without having to bother with the details of actually
  performing file I/O. \par In this paper, we describe the design and
  implementation of SDM. With the help of two parallel application templates,
  ASTRO3D and an Euler solver, we illustrate how some of the design criteria
  affect performance.}
}

@InProceedings{no:irregular-io,
  author = {Jaechun No and Sung-soon Park and Jesus Carretero and Alok
  Choudhary and Pang Chen},
  title = {Design and Implementation of a Parallel {I/O} Runtime System for
  Irregular Applications},
  booktitle = {Proceedings of the Joint International Parallel Processing
  Symposium and IEEE Symposium on Parallel and Distributed Processing},
  year = {1998},
  month = {March},
  pages = {280--284},
  publisher = {IEEE Computer Society Press},
  later = {no:jirregular},
  URL = {http://computer.org/proceedings/ipps/8403/8403toc.htm},
  keywords = {parallel I/O, pario-bib},
  comment = {see no:irregular2 and no:irregular.}
}

@InProceedings{no:irregular2,
  author = {Jaechun No and J. Carretero and Alok Choudhary},
  title = {High performance parallel {I/O} schemes for irregular applications
  on clusters of workstations},
  booktitle = {Proceedings of the Seventh High-Performance Computing and
  Networking Conference},
  year = {1999},
  pages = {1117--1126},
  earlier = {no:irregular-io},
  keywords = {parallel I/O, irregular applications, pario-bib},
  abstract = {Due to the convergence of the fast microprocessors with low
  latency and high bandwidth communication networks, clusters of workstations
  are being used for high-performance computing. In this paper we present the
  design and implementation of a runtime system to support irregular
  applications on clusters of workstations, called "Collective I/O Clustering".
  The system provides a friendly programming model for performing I/O in
  irregular applications on clusters of workstations, and is completely
  integrated with the underlying communication and I/O system. All the
  performance results were obtained on the IBM-SP machine, located at Argonne
  National Labs}
}

@Article{no:jirregular,
  author = {Jaechun No and Jesus Carretero and Sung-soon Park and Alok
  Choudhary and Pang Chen},
  title = {Design and Implementation of a Parallel {I/O} Runtime System for
  Irregular Applications},
  journal = {Journal of Parallel and Distributed Computing},
  year = {2002},
  month = {February},
  volume = {62},
  number = {2},
  pages = {193--220},
  publisher = {Elsevier Science},
  earlier = {no:irregular-io},
  URL = {http://www.ece.nwu.edu/~jno/PAPER/jpdc.ps},
  keywords = {parallel I/O, pario-bib}
}

@InProceedings{nodine:deterministic,
  author = {M. H. Nodine and J. S. Vitter},
  title = {Deterministic Distribution Sort in Shared and Distributed Memory
  Multiprocessors},
  booktitle = {Proceedings of the Fifth Symposium on Parallel Algorithms and
  Architectures},
  year = {1993},
  pages = {120--129},
  address = {Velen, Germany},
  keywords = {parallel I/O algorithm, sorting, shared memory, pario-bib},
  abstract = {We present an elegant deterministic load balancing strategy for
  distribution sort that is applicable to a wide variety of parallel disks and
  parallel memory hierarchies with both single and parallel processors. The
  simplest application of the strategy is an optimal deterministic algorithm
  for external sorting with multiple disks and parallel processors. In each
  input/output (I/O) operation, each of the $D \geq 1$ disks can simultaneously
  transfer a block of $B$ contiguous records. Our two measures of performance
  are the number of I/Os and the amount of work done by the CPU(s); our
  algorithm is simultaneously optimal for both measures. We also show how to
  sort deterministically in parallel memory hierarchies. When the processors
  are interconnected by any sort of a PRAM, our algorithms are optimal for all
  parallel memory hierarchies; when the interconnection network is a hypercube,
  our algorithms are either optimal or best-known.},
  comment = {Short version of nodine:sort2 and nodine:sortdisk.}
}

@TechReport{nodine:greed,
  author = {Mark H. Nodine and Jeffrey Scott Vitter},
  title = {Greed Sort: An Optimal External Sorting Algorithm for Multiple
  Disks},
  year = {1992},
  number = {CS--91--20},
  institution = {Brown University},
  note = {A summary appears in SPAA~'91},
  URL =
  {http://www.cs.brown.edu/publications/techreports/reports/CS-91-20.html},
  keywords = {parallel I/O algorithms, sorting, pario-bib},
  abstract = {We present an optimal deterministic algorithm for external
  sorting on multiple disks. Our measure of performance is the number of
  input/output (I/O) operations. In each I/O, each disk can simultaneously
  transfer a block of data. Our algorithm improves upon a recent randomized
  optimal algorithm and the (non-optimal) commonly used technique of disk
  striping. The code is simple enough for easy implementation.},
  comment = {Summary is nodine:sort. This is revision of CS--91--04.}
}

@InProceedings{nodine:loadbalance,
  author = {Mark H. Nodine and Jeffrey Vitter},
  title = {Load Balancing Paradigms for Optimal Use of Parallel Disks and
  Parallel Memory Hierarchies},
  booktitle = {Proceedings of the 1993 DAGS/PC Symposium},
  year = {1993},
  month = {June},
  pages = {26--39},
  organization = {Dartmouth Institute for Advanced Graduate Studies},
  address = {Hanover, NH},
  keywords = {parallel I/O algorithm, memory hierarchy, load balance, sorting,
  pario-bib},
  abstract = {We present several load balancing paradigms pertinent to
  optimizing I/O performance with disk and processor parallelism. We use
  sorting as our canonical application to illustrate the paradigms, and we
  survey a wide variety of applications in computational geometry. The use of
  parallel disks can help overcome the I/O bottleneck in sorting if the records
  in each read or write are evenly balanced among the disks. There are three
  known load balancing paradigms that lead to optimal I/O algorithms: using
  randomness to assign blocks to disks, using the disks predominantly
  independently, and deterministically balancing the blocks by matching. In
  this report, we describe all of these techniques in detail and compare their
  relative advantages. We show how randomized and deterministic balancing can
  be extended to provide sorting algorithms that are optimal both in terms of
  the number of I/Os and the internal processing time for parallel-processing
  machines with scalable I/O subsystems and with parallel memory hierarchies.
  We also survey results achieving optimal performance in the these models for
  a large range of online and batch problems in computational geometry.},
  comment = {Invited speaker: Jeffrey Vitter.}
}

@InProceedings{nodine:opt-sort,
  author = {Mark H. Nodine and Jeffrey Scott Vitter},
  title = {Paradigms for Optimal Sorting with Multiple Disks},
  booktitle = {Proceedings of the Twenty-Sixth Annual Hawaii International
  Conference on System Sciences},
  year = {1993},
  volume = {I},
  pages = {50--59},
  keywords = {parallel I/O algorithms, sorting, pario-bib},
  comment = {They compare three techniques for balancing I/O across parallel
  disks, using sorting as an example. The three are randomization, using disks
  independently (as in balance sort), or tricky matching techniques as in
  balance sort. They also look at parallel memory hierarchies. All in all, it
  seems to be mostly a survey of techniques in earlier papers.}
}

@InProceedings{nodine:sort,
  author = {Mark H. Nodine and Jeffrey Scott Vitter},
  title = {Large-Scale Sorting in Parallel Memories},
  booktitle = {Proceedings of the Third Symposium on Parallel Algorithms and
  Architectures},
  year = {1991},
  pages = {29--39},
  keywords = {external sorting, file access pattern, parallel I/O, pario-bib},
  comment = {Describes algorithms for external sorting that are optimal in the
  number of I/Os. Proposes a couple of fairly-realistic memory hierarchy
  models. See also journal version vitter:uniform.}
}

@TechReport{nodine:sort2,
  author = {Mark H. Nodine and Jeffrey Scott Vitter},
  title = {Optimal Deterministic Sorting in Parallel Memory Hierarchies},
  year = {1992},
  month = {August},
  number = {CS--92--38},
  institution = {Brown University},
  URL = {ftp://ftp.cs.brown.edu/pub/techreports/92/cs92-38.ps.Z},
  keywords = {parallel I/O algorithms, parallel memory, sorting, pario-bib},
  comment = {see nodine:deterministic.}
}

@TechReport{nodine:sortdisk,
  author = {Mark H. Nodine and Jeffrey Scott Vitter},
  title = {Optimal Deterministic Sorting on Parallel Disks},
  year = {1992},
  month = {August},
  number = {CS--92--08},
  institution = {Brown University},
  URL = {ftp://ftp.cs.brown.edu/pub/techreports/92/cs92-08.ps.Z},
  keywords = {parallel I/O algorithms, sorting, pario-bib},
  comment = {see nodine:deterministic.}
}

@InProceedings{nurmi:atm,
  author = {Marc A. Nurmi and William E. Bejcek and Rod N. Gregoire and K. C.
  Liu and Mark D. Pohl},
  title = {Automatic Management of {CPU} and {I/O} Bottlenecks in Distributed
  Applications on {ATM} Networks},
  booktitle = {Proceedings of the Fifth IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1996},
  month = {August},
  pages = {481--489},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, ATM, parallel networking, pario-bib},
  abstract = {Existing parallel programming environments for networks of
  workstations improve the performance of computationally intensive
  applications by using message passing or virtual shared memory to alleviate
  CPU bottlenecks. This paper describes an approach based on message passing
  that addresses both CPU and I/O bottlenecks for a specific class of
  distributed applications on ATM networks. ATM provides the bandwidth required
  to utilize multiple I/O channels in parallel. This paper also describes an
  environment based on distributed process management and centralized
  application management that implements the approach. The environment adds
  processes to a running application when necessary to alleviate CPU and I/O
  bottlenecks while managing process connections in a manner that is
  transparent to the application.}
}

@TechReport{ober:seismic,
  author = {Curtis Ober and Ron Oldfield and John VanDyke and David Womble},
  title = {Seismic Imaging on Massively Parallel Computers},
  year = {1996},
  month = {April},
  number = {SAND96-1112},
  institution = {Sandia National Laboratories},
  URL = {ftp://ftp.cs.sandia.gov/pub/papers/dewombl/seismic_imaging_mpp.ps.Z},
  keywords = {multiprocessor application, scientific computing, seismic data
  processing, parallel I/O, pario-bib},
  abstract = {Fast, accurate imaging of complex, oil-bearing geologies, such as
  overthrusts and salt domes, is the key to reducing the costs of domestic oil
  and gas exploration. Geophysicists say that the known oil reserves in the
  Gulf of Mexico could be significantly increased if accurate seismic imaging
  beneath salt domes was possible. A range of techniques exist for imaging
  these regions, but the highly accurate techniques involve the solution of the
  wave equation and are characterized by large data sets and large
  computational demands. Massively parallel computers can provide the
  computational power for these highly accurate imaging techniques. \par A
  brief introduction to seismic processing will be presented, and the
  implementation of a seismic-imaging code for distributed memory computers
  will be discussed. The portable code, Salvo, performs a wave-equation-based,
  3-D, prestack, depth imaging and currently runs on the Intel Paragon, the
  Cray T3D and SGI Challenge series. It uses MPI for portability, and has
  sustained 22 Mflops/sec/proc (compiled FORTRAN) on the Intel Paragon.},
  comment = {2 pages about their I/O scheme, mostly regarding a calculation of
  the optimal balance between compute nodes and I/O nodes to achieve perfect
  overlap.}
}

@InProceedings{ober:seismic2,
  author = {Curtis Ober and Ron Oldfield and David Womble and John VanDyke and
  Sudip Dosanjh},
  title = {Seismic imaging on massively parallel computers},
  booktitle = {Proceedings of the 1996 Simulations Multiconference},
  year = {1996},
  month = {April},
  URL = {ftp://ftp.cs.dartmouth.edu/pub/raoldfi/salvo/smc96.ps.gz},
  keywords = {parallel application, scientific computing, seismic data
  processing, parallel I/O, pario-bib, oldfield}
}

@InProceedings{ober:seismic3,
  author = {Curtis Ober and Ron Oldfield and David Womble and L. Romero and
  Charles Burch},
  title = {Practical aspects of prestack depth migration with finite
  differences},
  booktitle = {Proceedings of the 67th Annual International Meeting of the
  Society of Exploration Geophysicists},
  year = {1997},
  month = {November},
  pages = {1758--1761},
  address = {Dallas Texas},
  note = {Expanded Abstracts},
  keywords = {parallel application, scientific computing, seismic data
  processing, parallel I/O, pario-bib, oldfield}
}

@TechReport{oed:t3d,
  author = {Wilfried Oed},
  title = {The {Cray Research} Massively Parallel Processor System {CRAY T3D}},
  year = {1993},
  month = {November 15},
  institution = {Cray Research GmbH},
  address = {M\"unchen, Germany},
  keywords = {parallel architecture, shared memory, supercomputer, parallel
  I/O, pario-bib},
  comment = {A MIMD, shared-memory machine, with 2-processor units embedded in
  a 3-d torus. Each link is bidirectional and runs 300 MB/s. Processors are 150
  MHz ALPHA, plus 16--64 MB RAM, plus a memory interface unit. Global physical
  address space with remote-reference and block-transfer capability. Not clear
  about cache coherency. Separate tree network for global synchronization.
  Support for message send and optional interrupt. I/O is all done through
  interface nodes that hook to the YMP host and to its I/O clusters with 400
  MB/s links. I/O is by default serialized, but they do support a ``broadcast''
  read operation (but see pase:t3d-fortran). FORTRAN compiler supports the NUMA
  shared memory; PVM is used for C and message passing.}
}

@TechReport{ogata:diskarray,
  author = {Mikito Ogata and Michael J. Flynn},
  title = {A Queueing Analysis for Disk Array Systems},
  year = {1990},
  number = {CSL-TR-90-443},
  institution = {Stanford University},
  keywords = {disk array, performance analysis, pario-bib},
  comment = {Fairly complex analysis of a multiprocessor attached to a disk
  array system through a central server that is the buffer. Assumes
  task-oriented model for parallel system, where tasks can be assigned to any
  CPU; this makes for an easy model. Like Reddy, they compare declustering and
  striping (they call them striped and synchronized disks).}
}

@InProceedings{okeefe:fibre,
  author = {Matthew T. O'Keefe},
  title = {Shared File Systems and {Fibre Channel}},
  booktitle = {Proceedings of the Sixth NASA Goddard Conference on Mass Storage
  Systems and Technologies},
  year = {1998},
  month = {March},
  pages = {1--16},
  publisher = {IEEE Computer Society Press},
  address = {College Park, MD},
  URL = {http://gfs.lcse.umn.edu/pubs/shared_file_systems_1.0.pdf},
  keywords = {distributed file system, data storage, mass storage,
  network-attached disks, Fibre Channel, pario-bib},
  comment = {position paper}
}

@TechReport{oldfield:app-pario,
  author = {Ron Oldfield and David Kotz},
  title = {Applications of Parallel {I/O}},
  year = {1998},
  month = {August},
  number = {PCS-TR98-337},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  note = {Supplement to PCS-TR96-297},
  earlier = {kotz:app-pario},
  later = {oldfield:bapp-pario},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/163/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/oldfield:app-pario.pdf},
  keywords = {parallel I/O application, file access patterns, pario-bib, dfk},
  abstract = {Scientific applications are increasingly being implemented on
  massively parallel supercomputers. Many of these applications have intense
  I/O demands, as well as massive computational requirements. This paper is
  essentially an annotated bibliography of papers and other sources of
  information about scientific applications using parallel I/O. It will be
  updated periodically.}
}

@InProceedings{oldfield:armada,
  author = {Ron Oldfield and David Kotz},
  title = {{Armada}: A parallel file system for computational grids},
  booktitle = {Proceedings of the First IEEE/ACM International Symposium on
  Cluster Computing and the Grid},
  year = {2001},
  month = {May},
  pages = {194--201},
  publisher = {IEEE Computer Society Press},
  copyright = {IEEE},
  address = {Brisbane, Australia},
  URL = {http://www.cs.dartmouth.edu/~dfk/papers/oldfield:armada.ps.gz},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/oldfield:armada.pdf},
  keywords = {parallel I/O, Grid, parallel file system, pario-bib},
  abstract = {High-performance distributed computing appears to be shifting
  away from tightly-connected supercomputers to computational grids composed of
  heterogeneous systems of networks, computers, storage devices, and various
  other devices that collectively act as a single geographically distributed
  virtual computer. One of the great challenges for this environment is
  providing efficient parallel data access to remote distributed datasets. In
  this paper, we discuss some of the issues associated with parallel I/O and
  computatational grids and describe the design of a flexible parallel file
  system that allows the application to control the behavior and functionality
  of virtually all aspects of the file system.},
  comment = {Named one of two "best" papers in the Grid category.}
}

@InCollection{oldfield:bapp-pario,
  author = {Ron Oldfield and David Kotz},
  title = {Scientific Applications using Parallel {I/O}},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {45},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {655--666},
  publisher = {IEEE Computer Society Press and John Wiley \& Sons},
  copyright = {IEEE},
  earlier = {oldfield:app-pario},
  keywords = {parallel I/O application, file access patterns, pario-bib, dfk},
  abstract = {Scientific applications are increasingly being implemented on
  massively parallel supercomputers. Many of these applications have intense
  I/O demands, as well as massive computational requirements. This paper is
  essentially an annotated bibliography of papers and other sources of
  information about scientific applications using parallel I/O.},
  comment = {Part of jin:io-book.}
}

@TechReport{oldfield:emulab-tr,
  author = {Ron Oldfield and David Kotz},
  title = {Using the {Emulab} network testbed to evaluate the {Armada} {I/O}
  framework for computational grids},
  year = {2002},
  month = {September},
  number = {TR2002-433},
  institution = {Dept. of Computer Science, Dartmouth College},
  address = {Hanover, NH},
  URL = {https://digitalcommons.dartmouth.edu/cs_tr/201/},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/oldfield:emulab-tr.pdf},
  keywords = {emulab, network emulation, Armada, performance, dfk, pario-bib},
  abstract = {This short report describes our experiences using the Emulab
  network testbed at the University of Utah to test performance of the Armada
  framework for parallel I/O on computational grids.}
}

@Article{oldfield:restruct,
  author = {Ron Oldfield and David Kotz},
  title = {Improving data access for computational grid applications},
  journal = {Cluster Computing, The Journal of Networks, Software Tools and
  Applications},
  year = {2004},
  copyright = {the authors},
  note = {Accepted for publication},
  keywords = {parallel I/O, Grid computing, distributed computing, graph
  algorithms, pario-bib},
  abstract = {High-performance computing increasingly occurs on ``computational
  grids'' composed of heterogeneous and geographically distributed systems of
  computers, networks, and storage devices that collectively act as a single
  ``virtual'' computer. A key challenge in this environment is to provide
  efficient access to data distributed across remote data servers. Our parallel
  I/O framework, called Armada, allows application and data-set providers to
  flexibly compose graphs of processing modules that describe the distribution,
  application interfaces, and processing required of the dataset before
  computation. Although the framework provides a simple programming model for
  the application programmer and the data-set provider, the resulting graph may
  contain bottlenecks that prevent efficient data access. In this paper, we
  present an algorithm used to restructure Armada graphs that distributes
  computation and data flow to improve performance in the context of a
  wide-area computational grid.}
}

@Article{oldfield:seismic,
  author = {Ron A. Oldfield and David E. Womble and Curtis C. Ober},
  title = {Efficient Parallel {I/O} in Seismic Imaging},
  journal = {The International Journal of High Performance Computing
  Applications},
  year = {1998},
  month = {Fall},
  volume = {12},
  number = {3},
  pages = {333--344},
  publisher = {Sage Science Press},
  copyright = {Sage Publications},
  URL = {ftp://ftp.cs.dartmouth.edu/pub/raoldfi/salvo/salvoIO.ps.gz},
  keywords = {parallel I/O application, pario-bib},
  abstract = {While high performance computers tend to be measured by their
  processor and communications speeds, the bottleneck for many large-scale
  applications is the I/O performance rather than the computational or
  communication performance. One such application is the processing of 3D
  seismic data. Seismic data sets, consisting of recorded pressure waves, can
  be very large, sometimes more than a terabyte in size. Even if the
  computations can be performed in-core, the time required to read the initial
  seismic data and velocity model and write images is substantial. This paper
  will discuss our approach in handling the massive I/O requirements of seismic
  processing and show the performance of our imaging code (Salvo) on the Intel
  Paragon.},
  comment = {In a Special Issue on I/O in Parallel Applications, volume 12,
  numbers 3 and 4.}
}

@PhdThesis{oldfield:thesis,
  author = {Ron Oldfield},
  title = {Efficient {I/O} for Computational Grid Applications},
  year = {2003},
  month = {May},
  school = {Dept. of Computer Science, Dartmouth College},
  note = {Available as Dartmouth Computer Science Technical Report
  TR2003-459.},
  later = {oldfield:thesis-tr},
  keywords = {parallel I/O, Grid computing, pario-bib},
  abstract = {High-performance computing increasingly occurs on "computational
  grids" composed of heterogeneous and geographically distributed systems of
  computers, networks, and storage devices that collectively act as a single
  "virtual" computer. A key challenge in this environment is to provide
  efficient access to data distributed across remote data servers. This
  dissertation explores some of the issues associated with I/O for wide-area
  distributed computing and describes an I/O system, called Armada, with the
  following features: a framework to allow application and dataset providers to
  flexibly compose graphs of processing modules that describe the distribution,
  application interfaces, and processing required of the dataset before or
  after computation; an algorithm to restructure application graphs to increase
  parallelism and to improve network performance in a wide-area network; and a
  hierarchical graph-partitioning scheme that deploys components of the
  application graph in a way that is both beneficial to the application and
  sensitive to the administrative policies of the different administrative
  domains. Experiments show that applications using Armada perform well in both
  low- and high-bandwidth environments, and that our approach does an
  exceptional job of hiding the network latency inherent in grid computing.}
}

@TechReport{oldfield:thesis-tr,
  author = {Ron Oldfield},
  title = {Efficient {I/O} for Computational Grid Applications},
  year = {2003},
  month = {May},
  number = {TR2003-459},
  institution = {Dept. of Computer Science, Dartmouth College},
  earlier = {oldfield:thesis},
  URL = {https://digitalcommons.dartmouth.edu/dissertations/5/},
  URLpdf = {ftp://ftp.cs.dartmouth.edu/TR/TR2003-459.pdf},
  keywords = {parallel I/O, Grid computing, pario-bib},
  abstract = {High-performance computing increasingly occurs on "computational
  grids" composed of heterogeneous and geographically distributed systems of
  computers, networks, and storage devices that collectively act as a single
  "virtual" computer. A key challenge in this environment is to provide
  efficient access to data distributed across remote data servers. This
  dissertation explores some of the issues associated with I/O for wide-area
  distributed computing and describes an I/O system, called Armada, with the
  following features: a framework to allow application and dataset providers to
  flexibly compose graphs of processing modules that describe the distribution,
  application interfaces, and processing required of the dataset before or
  after computation; an algorithm to restructure application graphs to increase
  parallelism and to improve network performance in a wide-area network; and a
  hierarchical graph-partitioning scheme that deploys components of the
  application graph in a way that is both beneficial to the application and
  sensitive to the administrative policies of the different administrative
  domains. Experiments show that applications using Armada perform well in both
  low- and high-bandwidth environments, and that our approach does an
  exceptional job of hiding the network latency inherent in grid computing.}
}

@Article{olson:random,
  author = {Thomas M. Olson},
  title = {Disk Array Performance in a Random {I/O} Environment},
  journal = {Computer Architecture News},
  year = {1989},
  month = {September},
  volume = {17},
  number = {5},
  pages = {71--77},
  keywords = {I/O benchmark, transaction processing, pario-bib},
  comment = {See wolman:iobench. Used IOBENCH to compare normal disk
  configuration with striped disks, RAID level 1, and RAID level 5, under a
  random I/O workload. Multiple disks with files on different disks gave good
  performance (high throughput and low response time) when multiple users.
  Striping ensures balanced load, similar performance. RAID level 1 or level 5
  ensures reliability at performance cost over striping, but still good.
  Especially sensitive to write/read ratio --- performance lost for large
  number of writes.}
}

@InProceedings{oyang:m2io,
  author = {Yen-Jen Oyang},
  title = {Architecture, Operating System, and {I/O} Subsystem Design of the
  {$M^2$} Database Machine},
  booktitle = {Proceedings of the Parallel Systems Fair at the International
  Parallel Processing Symposium},
  year = {1993},
  pages = {31--38},
  keywords = {parallel I/O, multiprocessor file system, parallel database,
  pario-bib},
  comment = {A custom multiprocessor with a shared-memory clusters networked
  together and to shared disks. Runs Mach. Directory-based coherence protocol
  for the distributed file system. Background writeback.}
}

@InProceedings{pahuja:dpio,
  author = {Neena Pahuja and Gautam M. Shroff},
  title = {A Data Parallel I/O Library for Workstation Networks},
  booktitle = {Proceedings of the 1995 International Conference on High
  Performance Computing},
  year = {1995},
  month = {December},
  pages = {423--428},
  address = {New Delhi, India},
  keywords = {disk array, multimedia, parallel I/O, pario-bib}
}

@InProceedings{paleczny:support,
  author = {Michael Paleczny and Ken Kennedy and Charles Koelbel},
  title = {Compiler Support for Out-of-Core Arrays on Data Parallel Machines},
  booktitle = {Proceedings of the Fifth Symposium on the Frontiers of Massively
  Parallel Computation},
  year = {1995},
  month = {February},
  pages = {110--118},
  address = {McLean, VA},
  URL = {http://www.cs.rice.edu/~mpal/papers/Frontiers95.ps},
  keywords = {compilers, parallel I/O, out-of-core applications, pario-bib},
  comment = {They are developing extensions to the FortranD compiler so that it
  supports I/O-related directives for out-of-core computations. The compiler
  then analyzes the computation, inserts the necessary I/O calls, and optimizes
  the I/O. They hand-compile a red-black relaxation program and an
  LU-factorization program. I/O was much faster than VM, particularly because
  they were able to make large requests rather than faulting on individual
  pages. Overlapping I/O and computation was also a big win. See also
  kennedy:sio, bordawekar:model.}
}

@Misc{panasas:architecture,
  key = {PA},
  title = {Object-based Storage Architecture: Defining a new generation of
  storage systems built on distributed, intelligent storage devices},
  year = {2003},
  month = {October},
  howpublished = {Panasas Inc. white paper, version 1.0},
  note = {http://www.panasas.com/docs/},
  URL = {http://www.panasas.com/docs/Object_Storage_Architecture_WP.pdf},
  keywords = {object-based storage, distributed file system, parallel file
  system, pario-bib},
  comment = {The paper describes the architecture of proprietary object-based
  storage system for clusters--an extension of Garth Gibson's NASD work at CMU
  (see gibson:nasd-tr). Similar to Lustre (cfs:lustre, braam:lustre-arch).}
}

@InProceedings{panfilov:raid5,
  author = {Oleg A. Panfilov},
  title = {Performance Analysis of {RAID-5} Disk Arrays},
  booktitle = {Proceedings of the Twenty-Eighth Annual Hawaii International
  Conference on System Sciences},
  year = {1995},
  month = {January},
  volume = {I},
  pages = {49--60},
  keywords = {RAID, disk array, parallel I/O, pario-bib}
}

@Article{papadopouli:vbr-streams,
  author = {Maria Papadopouli and Leana Golubchik},
  title = {Support of {VBR} Video Streams Under Disk Bandwidth Limitations},
  journal = {ACM SIGMETRICS Performance Evaluation Review},
  year = {1997},
  month = {December},
  volume = {25},
  number = {3},
  pages = {13--20},
  URL = {http://doi.acm.org/10.1145/270900.270903},
  keywords = {multimedia, video on demand, parallel I/O, pario-bib},
  comment = {Part of a special issue on parallel and distributed I/O.}
}

@Article{park:2disk,
  author = {{Chan-Ik} Park},
  title = {Efficient Placement of Parity and Data To Tolerate Two Disk Failures
  in Disk Array Systems},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {1995},
  month = {November},
  volume = {6},
  number = {11},
  pages = {1177--1184},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, disk array, reliability, fault tolerance,
  pario-bib},
  abstract = {In this paper, we deal with the data/parity placement problem
  which is described as follows: how to place data and parity evenly across
  disks in order to tolerate two disk failures, given the number of disks N and
  the redundancy rate p which represents the amount of disk spaces to store
  parity information. To begin with, we transform the data/parity placement
  problem into the problem of constructing an N x N matrix such that the matrix
  will correspond to a solution to the problem. The method to construct a
  matrix has been proposed and we have shown how our method works through
  several illustrative examples. It is also shown that any matrix constructed
  by our proposed method can be mapped into a solution to the placement problem
  if a certain condition holds between N and p where N is the number of disks
  and p is a redundancy rate.}
}

@InProceedings{park:interface,
  author = {Yoonho Park and Ridgway Scott and Stuart Sechrest},
  title = {Virtual Memory Versus File Interfaces for Large, Memory-intensive
  Scientific Applications},
  booktitle = {Proceedings of Supercomputing '96},
  year = {1996},
  month = {November},
  publisher = {ACM Press and IEEE Computer Society Press},
  note = {Also available as UH Department of Computer Science Research Report
  UH-CH-96-7},
  URL = {http://www.hpc.uh.edu/cenju/pub/vm_revisit.ps},
  keywords = {virtual memory, file interface, scientific applications,
  out-of-core, parallel I/O, pario-bib},
  abstract = {Scientific applications often require some strategy for temporary
  data storage to do the largest possible simulations. The use of virtual
  memory for temporary data storage has received criticism because of
  performance problems. However, modern virtual memory found in recent
  operating systems such as Cenju-3/DE give application writers control over
  virtual memory policies. We demonstrate that custom virtual memory policies
  can dramatically reduce virtual memory overhead and allow applications to run
  out-of-core efficiently. We also demonstrate that the main advantage of
  virtual memory, namely programming simplicity, is not lost.},
  comment = {Web and CDROM only. They advocate the use of traditional
  demand-paged virtual memory systems in supporting out-of-core applications.
  They are implementing an operating system for the NEC Cenju-3/DE, a
  shared-nothing MIMD multiprocessor with a multistage interconnection network
  and disks on every node. The operating system is based on Mach, and they have
  extended Mach to allow user-provided [local] replacement policies. Basically,
  they argue that you can get good performance as long as you write your own
  replacement policy (even OPT is possible in certain applications), and that
  this is easier than (re)writing the application with explicit out-of-core
  file I/O calls. They measure the performance of two applications on their
  system, with OPT, FIFO, and a new replacement algorithm customized to one of
  the applications. They show that they can get much better performance with
  some replacement policies than with others, but despite the paper's title
  they do not compare with the performance of an equivalent program using file
  I/O.}
}

@TechReport{park:pario,
  author = {Arvin Park and K. Balasubramanian},
  title = {Providing Fault Tolerance in Parallel Secondary Storage Systems},
  year = {1986},
  month = {November},
  number = {CS-TR-057-86},
  institution = {Department of Computer Science, Princeton University},
  keywords = {parallel I/O, reliability, RAID, pario-bib},
  comment = {They use ECC with one or more parity drives in bit-interleaved
  systems, and on-line regeneration of failed drives from spares. More
  cost-effective than mirrored disks. One of the earliest references to
  RAID-like concepts. Basically, they describe RAID3.}
}

@InProceedings{parsons:complex,
  author = {Ian Parsons and Jonathan Schaeffer and Duane Szafron and Ron
  Unrau},
  title = {Using {PI/OT} to Support Complex Parallel {I/O}},
  booktitle = {Proceedings of the Joint International Parallel Processing
  Symposium and IEEE Symposium on Parallel and Distributed Processing},
  year = {1998},
  month = {March},
  pages = {285--291},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, pario-bib}
}

@Article{parsons:templates,
  author = {Ian Parsons and Ron Unrau and Jonathan Schaeffer and Duane
  Szafron},
  title = {{PI/OT}: Parallel {I/O} Templates},
  journal = {Parallel Computing},
  year = {1997},
  month = {June},
  volume = {23},
  number = {4},
  pages = {543--570},
  publisher = {North-Holland (Elsevier Scientific)},
  keywords = {parallel programming, parallel I/O, pario-bib},
  abstract = {This paper presents a novel, top-down, high-level approach to
  parallelizing file I/O. Each parallel file descriptor is annotated with a
  high-level specification, or template, of the expected parallel behaviour.
  The annotations are external to and independent of the source code. At
  run-time, all I/O using a parallel file descriptor adheres to the semantics
  of the selected template. By separating the parallel I/O specifications from
  the code, a user can quickly change the I/O behaviour without rewriting code.
  Templates can be composed hierarchically to construct complex access
  patterns. \par Two sample parallel programs using these templates are
  compared against versions implemented in an existing parallel I/O system
  (PIOUS). The sample programs show that the use of parallel I/O templates are
  beneficial from both the performance and software engineering points of
  view.},
  comment = {An interesting approach in which they try to separate the
  description of the parallelism in a file's access from the sequential
  programming used to access the file. Seems like a good idea. It seems to
  assume that the programmer was porting an existing sequential code, or
  prefers to write their parallel program with a sequential frame of mind,
  including the existing fopen/fread/fwrite stdio interface. They retain the
  traditional stream-of-bytes file structure. See also parsons:complex.}
}

@TechReport{pase:t3d-fortran,
  author = {Douglas M. Pase and Tom MacDonald and Andrew Meltzer},
  title = {{MPP Fortran} Programming Model},
  year = {1993},
  month = {October 11},
  institution = {Cray Research, Inc.},
  URL = {ftp://ftp.cray.com/product-info/program_env/program_model.html},
  keywords = {compiler, parallel language, supercomputing, parallel I/O,
  pario-bib},
  abstract = {This report describes the MPP Fortran programming model which
  will be supported on the first phase MPP systems. Based on existing and
  proposed standards, it is a work sharing model which combines features from
  existing models in a way that may be both efficiently implemented and
  useful.},
  comment = {See also oed:t3d for T3D overview. I only read the part about I/O.
  The only I/O support, apparently, is for each processor to open and access
  the file independently from all other processors.}
}

@InProceedings{pasquale:characterization,
  author = {Barbara K. Pasquale and George C. Polyzos},
  title = {Dynamic {I/O} characterization of {I/O} intensive scientific
  applications},
  booktitle = {Proceedings of Supercomputing '94},
  year = {1994},
  pages = {660--669},
  URL =
  {http://www.acm.org/pubs/citations/proceedings/supercomputing/198354/p660-pasquale/},
  keywords = {parallel I/O, pario-bib},
  abstract = {Understanding the characteristic I/O behavior of scientific
  applications is an integral part of the research and development efforts for
  the improvement of high performance I/O systems. This study focuses on
  application level I/O behavior with respect to both static and dynamic
  characteristics. We observed the San Diego Supercomputer Center's Cray C90
  workload and isolated the most I/O intensive applications. The combination of
  a low-level description of physical resource usage and the high-level
  functional composition of applications and scientific disciplines for this
  set reveals the major sources of I/O demand in the workload. We selected two
  applications from the I/O intensive set and performed a detailed analysis of
  their dynamic I/O behavior. These applications exhibited a high degree of
  regularity in their I/O activity over time and their characteristic I/O
  behaviors can be precisely described by one and two, respectively, recurring
  sequences of data accesses and computation periods.}
}

@InProceedings{pasquale:dynamic,
  author = {Barbara K. Pasquale and George C. Polyzos},
  title = {Dynamic {I/O} Characterization of {I/O} Intensive Scientific
  Applications},
  booktitle = {Proceedings of Supercomputing '94},
  year = {1994},
  month = {November},
  pages = {660--669},
  publisher = {IEEE Computer Society Press},
  address = {Washington, DC},
  keywords = {scientific computing, file access patterns, I/O, pario-bib},
  comment = {This paper extends some of their previous results, but the real
  bottom line here is that some scientific applications do a lot of I/O, the
  I/O us bursty, and the pattern of bursts is cyclic and regular. Seems like
  this cyclic nature could be a source of some optimization. Included in the
  parallel I/O bibliography because it is useful to that community, though they
  did not trace parallel workload.}
}

@InProceedings{pasquale:iowork,
  author = {Barbara K. Pasquale and George C. Polyzos},
  title = {A Static Analysis of {I/O} Characteristics of Scientific
  Applications in a Production Workload},
  booktitle = {Proceedings of Supercomputing '93},
  year = {1993},
  pages = {388--397},
  publisher = {IEEE Computer Society Press},
  address = {Portland, OR},
  keywords = {scientific computing, file access patterns, pario-bib},
  comment = {Analyzed one month of accounting records from Cray YMP8/864 in
  SDSC's production environment. Their base assumption is that scientific
  application I/O is regular and predictable, eg, repetitive periodic bursts,
  with distinct phases, repeating patterns, and sequential access. The goal is
  to characterize a set of I/O-intensive scientific applications and evaluate
  regularity of resource usage. They measure volumes and rates of applications
  and total system. Cumulative and average usage for each distinct non-system
  application. Most resource usage came from the 5\% of applications that were
  not system applications. ``Virtual I/O rate'' is the bytes transferred per
  CPU second, which is IMHO only a rough measure because sometimes I/O overlaps
  CPU time, and sometimes does not. They picked out long-running applications
  with a high virtual I/O rate. Top 50 applications had 71\% of bytes
  transferred and 10\% of CPU time. Of those, 4.66 MB/sec min, 131 MB/sec max.
  Of those they picked the ones executed most often. Cluster analysis showed
  only 1-2 clusters. Correlation between I/O and CPU time. Included in the
  parallel I/O bibliography because it is useful to that community, though they
  did not trace parallel workload.}
}

@Misc{pathforward-fs,
  key = {SGS},
  title = {Statement of Work: {SGS} File System},
  year = {2001},
  month = {April},
  howpublished = {ASCI PathForward Program: {DOE} National Nuclear Security
  Administration \& the {DOD} National Security Agency},
  URL = {http://www.llnl.gov/asci/pathforward_trilab/file_system_sow.pdf},
  keywords = {design, parallel file system, parallel I/O, pario-bib},
  comment = {Describes the requirements and desired performance features of a
  parallel file system designed for the DOE ASCI computers.}
}

@Article{patt:iosubsystem,
  author = {Yale N. Patt},
  title = {The {I/O} Subsystem: a Candidate for Improvement},
  journal = {IEEE Computer},
  year = {1994},
  month = {March},
  volume = {27},
  number = {3},
  pages = {15--16},
  keywords = {I/O, file system, parallel I/O, pario-bib},
  comment = {This is the intro to a special issue on I/O.}
}

@InCollection{patterson:binformed,
  author = {R. Hugo Patterson and Garth A. Gibson and Eka Ginting and Daniel
  Stodolsky and Jim Zelenka},
  title = {Informed Prefetching and Caching},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {16},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {224--244},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {patterson:informed},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {caching, prefetching, file system, hints, I/O, resource
  management, parallel I/O, pario-bib},
  comment = {Part of jin:io-book; reformatted version of patterson:informed.}
}

@InCollection{patterson:braid,
  author = {David Patterson and Garth Gibson and Randy Katz},
  title = {A Case for Redundant Arrays of Inexpensive Disks ({RAID})},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {1},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {3--14},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {patterson:raid},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, RAID, reliability, cost analysis, I/O bottleneck,
  disk array, pario-bib},
  comment = {Part of jin:io-book; reformatted version of patterson:raid.}
}

@InProceedings{patterson:informed,
  author = {R. Hugo Patterson and Garth A. Gibson and Eka Ginting and Daniel
  Stodolsky and Jim Zelenka},
  title = {Informed prefetching and caching},
  booktitle = {Proceedings of the Fifteenth ACM Symposium on Operating Systems
  Principles},
  year = {1995},
  month = {December},
  pages = {79--95},
  publisher = {ACM Press},
  address = {Copper Mountain, CO},
  earlier = {patterson:informed-tr},
  later = {patterson:binformed},
  keywords = {caching, prefetching, file system, hints, I/O, resource
  management, parallel I/O, pario-bib},
  abstract = {In this paper, we present aggressive, proactive mechanisms that
  tailor file system resource management to the needs of I/O-intensive
  applications. In particular, we show how to use application-disclosed access
  patterns (hints) to expose and exploit I/O parallelism, and to dynamically
  allocate file buffers among three competing demands: prefetching hinted
  blocks, caching hinted blocks for reuse, and caching recently used data for
  unhinted accesses. Our approach estimates the impact of alternative buffer
  allocations on application execution time and applies cost-benefit analysis
  to allocate buffers where they will have the greatest impact. We have
  implemented informed prefetching and caching in Digitals OSF/1 operating
  system and measured its performance on a 150 MHz Alpha equipped with 15 disks
  running a range of applications. Informed prefetching reduces the execution
  time of text search, scientific visualization, relational database queries,
  speech recognition, and object linking by 20-83\%. Informed caching reduces
  the execution time of computational physics by up to 42\% and contributes to
  the performance improvement of the object linker and the database. Moreover,
  applied to multiprogrammed, I/O-intensive workloads, informed prefetching and
  caching increase overall throughput.},
  comment = {See patterson:informed-tr for an earlier version. Programs may
  give hints to the file system about what they will read in the future, and in
  what order. Hints are used for informed prefetching and informed caching.
  Most interesting thing about this paper over the earlier ones is the buffer
  management. Prefetcher and demand fetcher both want buffers. LRU cache and
  hinted cache both could supply buffers (thru replacement). Each supplies a
  cost for giving up buffers and benefit for getting more buffers. These are
  expressed in a common 'currency', in terms of their expected effect on I/O
  service time, and a manager takes buffers from one and gives buffers to
  another when the benefits outweigh the costs. All is based on a simple model,
  which is further simplified in their implementation within OSF/1. Performance
  looks good, they can keep more disks busy in a parallel file system.
  Furthermore, informed caching helps reduce the number of I/Os. Indeed they
  'discover' MRU replacement policy automatically.}
}

@TechReport{patterson:informed-tr,
  author = {R. Hugo Patterson and Garth A. Gibson and Eka Ginting and Daniel
  Stodolsky and Jim Zelenka},
  title = {Informed Prefetching and Caching},
  year = {1995},
  number = {CMU-CS-95-134},
  institution = {School of Computer Science, Carnegie Mellon University},
  later = {patterson:informed},
  keywords = {caching, prefetching, file system, hints, I/O, resource
  management, parallel I/O, pario-bib},
  abstract = {The underutilization of disk parallelism and file cache buffers
  by traditional file systems induces I/O stall time that degrades the
  performance of modern microprocessor-based systems. In this paper, we present
  aggressive mechanisms that tailor file system resource management to the
  needs of I/O-intensive applications. In particular, we show how to use
  application-disclosed access patterns (hints) to expose and exploit I/O
  parallelism, and to dynamically allocate file buffers among three competing
  demands: prefetching hinted blocks, caching hinted blocks for reuse, and
  caching recently used data for unhinted accesses. Our approach estimates the
  impact of alternative buffer allocations on application execution time and
  applies a cost-benefit analysis to allocate buffers where they will have the
  greatest impact. We implemented informed prefetching and caching in DEC's
  OSF/1 operating system and measured its performance on a 150 MHz Alpha
  equipped with 15 disks. When running a range of applications including text
  search, 3D scientific visualization, relational database queries, speech
  recognition, and computational chemistry, informed prefetching reduces the
  execution time of four of these applications by 20% to 87%. Informed caching
  reduces the execution time of the fifth application by up to 30%.}
}

@InProceedings{patterson:latency,
  author = {R. H. Patterson and G. A. Gibson and M. Satyanarayanan},
  title = {Using Transparent Informed Prefetching to Reduce File Read Latency},
  booktitle = {Proceedings of the 1992 NASA Goddard Conference on Mass Storage
  Systems and Technologies},
  year = {1992},
  month = {September},
  pages = {329--342},
  later = {patterson:informed},
  URL = {http://www.cs.cmu.edu/afs/cs.cmu.edu/project/pdl/ftp/TIP/MSST.ps},
  keywords = {parallel I/O, file prefetching, file caching, pario-bib},
  comment = {This 'paper' is really an annotated set of slides.}
}

@InProceedings{patterson:pdis-tip,
  author = {R. Hugo Patterson and Garth A. Gibson},
  title = {Exposing {I/O} Concurrency with Informed Prefetching},
  booktitle = {Proceedings of the Third International Conference on Parallel
  and Distributed Information Systems},
  year = {1994},
  month = {September},
  pages = {7--16},
  later = {patterson:informed},
  URL = {http://www.cs.cmu.edu/afs/cs/project/pdl/ftp/TIP/PDIS.ps},
  keywords = {prefetching, parallel I/O, pario-bib},
  abstract = {Informed prefetching provides a simple mechanism for
  I/O-intensive, cache-ineffective applications to efficiently exploit
  highly-parallel I/O subsystems such as disk arrays. This mechanism, dynamic
  disclosure of future accesses, yields substantial benefits over sequential
  readahead mechanisms found in current file systems for non-sequen tial
  workloads. This paper reports the performance of the Transparent Informed
  Prefetching system (TIP), a minimal prototype implemented in a Mach 3.0
  system with up to four disks. We measured reductions by factors of up to 1.9
  and 3.7 in the execution time of two example applications: multi-file text
  search and scientific data visualization.},
  comment = {Also available in HTML format at
  http://www.cs.cmu.edu/Web/Groups/PDL/HTML-Papers/PDIS94/final.fm.html.}
}

@InProceedings{patterson:raid,
  author = {David Patterson and Garth Gibson and Randy Katz},
  title = {A case for redundant arrays of inexpensive disks {(RAID)}},
  booktitle = {Proceedings of the ACM SIGMOD International Conference on
  Management of Data},
  year = {1988},
  month = {June},
  pages = {109--116},
  publisher = {ACM Press},
  address = {Chicago, IL},
  later = {patterson:braid},
  keywords = {parallel I/O, RAID, reliability, cost analysis, I/O bottleneck,
  disk array, pario-bib},
  comment = {Make a good case for the upcoming I/O crisis, compare single large
  expensive disks (SLED) with small cheap disks. Outline five levels of RAID
  the give different reliabilities, costs, and performances. Block-interleaved
  with a single check disk (level 4) or with check blocks interspersed (level
  5) seem to give best performance for supercomputer I/O or database I/O or
  both. Note: the TR by the same name (UCB/CSD 87/391) is essentially
  identical.}
}

@InProceedings{patterson:raid2,
  author = {David Patterson and Peter Chen and Garth Gibson and Randy H. Katz},
  title = {Introduction to Redundant Arrays of Inexpensive Disks {(RAID)}},
  booktitle = {Proceedings of IEEE Compcon},
  year = {1989},
  month = {Spring},
  pages = {112--117},
  earlier = {patterson:raid},
  keywords = {parallel I/O, RAID, reliability, cost analysis, I/O bottleneck,
  disk array, pario-bib},
  comment = {A short version of patterson:raid, with some slight updates.}
}

@InProceedings{patterson:snapmirror,
  author = {{R. Hugo} Patterson and Stephen Manley and Mike Federwisch and Dave
  Hitz and Steve Kleiman and Shane Owara},
  title = {{SnapMirror}: File-System-Based Asynchronous Mirroring for Disaster
  Recovery},
  booktitle = {Proceedings of the USENIX FAST '02 Conference on File and
  Storage Technologies},
  year = {2002},
  month = {January},
  pages = {117--130},
  publisher = {USENIX Association},
  address = {Monterey, CA},
  URL =
  {http://www.usenix.org/publications/library/proceedings/fast02/patterson.html},
  keywords = {file systems, pario-bib},
  abstract = {Computerized data has become critical to the survival of an
  enterprise. Companies must have a strategy for recovering their data should a
  disaster such as a fire destroy the primary data center. Current mechanisms
  offer data managers a stark choice: rely on affordable tape but risk the loss
  of a full day of data and face many hours or even days to recover, or have
  the benefits of a fully synchronized on-line remote mirror, but pay steep
  costs in both write latency and network bandwidth to maintain the mirror. In
  this paper, we argue that asynchronous mirroring, in which batches of updates
  are periodically sent to the remote mirror, can let data managers find a
  balance between these extremes. First, by eliminating the write latency
  issue, asynchrony greatly reduces the performance cost of a remote mirror.
  Second, by storing up batches of writes, asynchronous mirroring can avoid
  sending deleted or overwritten data and thereby reduce network bandwidth
  requirements. Data managers can tune the update frequency to trade network
  bandwidth against the potential loss of more data. We present Snap-Mirror, an
  asynchronous mirroring technology that leverages file system snapshots to
  ensure the consistency of the remote mirror and optimize data transfer. We
  use traces of production filers to show that even updating an asynchronous
  mirror every 15 minutes can reduce data transferred by 30{PCT} to 80{PCT}. We
  find that exploiting file system knowledge of deletions is critical to
  achieving any reduction for no-overwrite file systems such as WAFL and LFS.
  Experiments on a running system show that using file system metadata can
  reduce the time to identify changed blocks from minutes to seconds compared
  to purely logical approaches. Finally, we show that using SnapMirror to
  update every 30 minutes increases the response time of a heavily loaded
  system only 22{PCT}. dollars depending on the size of the enterprise and the
  role of the data. With increasing frequency, companies are instituting
  disaster recovery plans to ensure appropriate data availability in the event
  of a catastrophic failure or disaster that destroys a site (e.g. flood, fire,
  or earthquake). It is relatively easy to provide redundant server and storage
  hardware to protect against the loss of physical resources. Without the data,
  however, the redundant hardware is of little use.}
}

@Article{patterson:tip,
  author = {R. Hugo Patterson and Garth A. Gibson and M. Satyanarayanan},
  title = {A Status Report on Research in Transparent Informed Prefetching},
  journal = {ACM Operating Systems Review},
  year = {1993},
  month = {April},
  volume = {27},
  number = {2},
  pages = {21--34},
  later = {patterson:informed},
  URL = {http://www.cs.cmu.edu/afs/cs/project/pdl/ftp/TIP/OSRev.ps},
  keywords = {file system, prefetching, operating system, pario-bib},
  abstract = {This paper focuses on extending the power of caching and
  prefetching to reduce file read latencies by exploiting application level
  hints about future I/O accesses. We argue that systems that disclose
  high-level knowledge can transfer optimization information across module
  boundaries in a manner consistent with sound software engineering principles.
  Such Transparent Informed Prefetching (TIP) systems provide a technique for
  converting the high through put of new technologies such as disk arrays and
  log-structured file systems into low latency for applications. Our
  preliminary experiments show that even without a high-throughput I/O sub
  system TIP yields reduced execution time of up to 30% for applications
  obtaining data from a remote file server and up to 13% for applications
  obtaining data from a single local disk. These experiments indicate that
  greater performance benefits will be available when TIP is integrated with
  low level resource management policies and highly parallel I/O subsystems
  such as disk arrays.},
  comment = {Not much new over previous TIP papers, but does have newer
  numbers. See patterson:tip1. Also appears in DAGS'93 (patterson:tip2).
  Previously appeared as TR CMU-CS-93-1.}
}

@InProceedings{patterson:tip2,
  author = {R. Hugo Patterson and Garth A. Gibson and M. Satyanarayanan},
  title = {Informed Prefetching: Converting High Throughput to Low Latency},
  booktitle = {Proceedings of the 1993 DAGS/PC Symposium},
  year = {1993},
  month = {June},
  pages = {41--55},
  organization = {Dartmouth Institute for Advanced Graduate Studies},
  address = {Hanover, NH},
  later = {patterson:informed},
  keywords = {file system, prefetching, operating system, pario-bib},
  abstract = {This paper focuses on extending the power of caching and
  prefetching to reduce file read latencies by exploiting application level
  hints about future I/O accesses. We argue that systems that disclose
  high-level knowledge can transfer optimization information across module
  boundaries in a manner consistent with sound software engineering principles.
  Such Transparent Informed Prefetching (TIP) systems provide a technique for
  converting the high throughput of new technologies such as disk arrays and
  log-structured file systems into low latency for applications. Our
  preliminary experiments show that even without a high-throughput I/O
  sub-system TIP yields reduced execution time of up to 30\% for applications
  obtaining data from a remote file server and up to 13\% for applications
  obtaining data from a single local disk. These experiments indicate that
  greater performance benefits will be available when TIP is integrated with
  low level resource management policies and highly parallel I/O subsystems
  such as disk arrays.},
  comment = {Invited speaker: Garth Gibson. Similar paper appeared in ACM OSR
  April 1993 (patterson:tip)}
}

@Misc{patterson:vterabytes,
  author = {David Patterson},
  title = {Terabytes $\gg$ Teraflops (or Why Work on Processors When {I/O} is
  Where the Action Is?)},
  year = {1993},
  howpublished = {Produced by University Video Communications},
  note = {Videotape},
  URL = {http://www.uvc.com/videos/06Patterson.video.html},
  keywords = {videotape, computer architecture, parallel I/O, pario-bib},
  abstract = {RISC pioneer and UC, Berkeley Computer Science Professor David
  Patterson is working to develop input/output systems to match the
  increasingly higher performance of new processors. Here he describes the
  results of the RAID (Redundant Arrays of Inexpensive Disks) project, which
  offers much greater performance, capacity, and reliability from I/O systems.
  Patterson also discusses a new project, Sequoia 2000, which looks at
  utilizing small helical scan tapes, such as digital-audiotapes or videotapes,
  to offer terabytes of storage for the price of a file/server. He believes
  that a 1000x increase in storage, available on most Ethernets, will have a
  much greater impact than a 1000x increase in processing speed.},
  comment = {See patterson:trends. 58 minutes.}
}

@InProceedings{pawlowski:parsort,
  author = {Markus Pawlowski and Rudolf Bayer},
  title = {Parallel Sorting of Large Data Volumes on Distributed Memory
  Multiprocessors},
  booktitle = {Parallel Computer Architectures: Theory, Hardware, Software,
  Applications},
  year = {1993},
  series = {Lecture Notes in Computer Science},
  volume = {732},
  pages = {246--264},
  publisher = {Springer-Verlag},
  address = {Berlin},
  keywords = {sorting, parallel I/O algorithm, pario-bib},
  comment = {Main contribution appears to be a new sampling method for initial
  partition of data set. They approach it from a database point of view.}
}

@TechReport{pearson:sorting,
  author = {Matthew D. Pearson},
  title = {Fast Out-of-Core Sorting on Parallel Disk Systems},
  year = {1999},
  month = {June},
  number = {PCS-TR99-351},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {Matthew D. Pearson},
  address = {Hanover, NH},
  URL = {ftp://ftp.cs.dartmouth.edu/TR/TR99-351.ps.Z},
  keywords = {parallel I/O, out of core, sorting, parallel algorithm,
  pario-bib},
  abstract = {This paper discusses our implementation of Rajasekaran's
  (l,m)-mergesort algorithm (LMM) for sorting on parallel disks. LMM is
  asymptotically optimal for large problems and has the additional advantage of
  a low constant in its I/O complexity. Our implementation is written in C
  using the ViC* I/O API for parallel disk systems. We compare the performance
  of LMM to that of the C library function qsort on a DEC Alpha server. qsort
  makes a good benchmark because it is fast and performs comparatively well
  under demand paging. Since qsort fails when the swap disk fills up, we can
  only compare these algorithms on a limited range of inputs. Still, on most
  out-of-core problems, our implementation of LMM runs between 1.5 and 1.9
  times faster than qsort, with the gap widening with increasing problem
  size.},
  comment = {Undergraduate Honors Thesis. Advisor: Tom Cormen.}
}

@InProceedings{perez:allocation,
  author = {Jose Maria Perez and Felix Garcia and Jesus Carretero and Alejandro
  Calderon and Luis Miguel Sanchez},
  title = {Data Allocation and Load Balancing for Heterogeneous Cluster Storage
  Systems},
  booktitle = {Workshop on Parallel I/O in Cluster Computing and Computational
  Grids},
  year = {2003},
  month = {May},
  pages = {718--723},
  publisher = {IEEE Computer Society Press},
  address = {Tokyo},
  note = {Organized at the IEEE/ACM International Symposium on Cluster
  Computing and the Grid 2003},
  URL =
  {http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190718abs.htm},
  keywords = {parallel I/O, load balancing, pario-bib},
  abstract = {Distributed filesystems are a typical solution in networked
  environments as clusters and grids. Parallel filesystems are a typical
  solution in order to reach high performance I/O distributed environment, but
  those filesystems have some limitations in heterogeneous storage systems.
  Usually in distributed systems, load balancing is used as a solution to
  improve the performance, but typically the distribution is made between
  peer-to-peer computational resources and from the processor point of view. In
  heterogeneous systems, like heterogeneous clusters of workstations, the
  existing solutions do not work so well. However, the utilization of those
  systems is more extended every day, having an extreme example in the grid
  environment. In this paper we bring attention to those aspects of
  heterogeneous distributed data systems presenting a parallel file system that
  take into account heterogeneity of storage nodes, the dynamic addition of new
  storage nodes, and an algorithm to group requests in heterogeneous systems.}
}

@InProceedings{perez:apriori,
  author = {M.~S. P\'erez and R.~A. Pons and F. Garc{\'\i}a and J. Carretero
  and M.~L. C\'ordoba},
  title = {An Optimization of Apriori Algorithm through the Usage of Parallel
  {I/O} and Hints},
  booktitle = {Rough Sets and Current Trends in Computing},
  year = {2002},
  month = {October},
  series = {Lecture Notes in Computer Science},
  number = {2475},
  pages = {449--452},
  publisher = {Springer-Verlag},
  URL = {http://link.springer.de/link/service/series/0558/tocs/t2475.htm},
  keywords = {parallel I/O, pario-bib},
  abstract = {Association rules are very useful and interesting patterns in
  many data mining scenarios. Apriori algorithm is the best- known association
  rule algorithm. This algorithm interacts with a storage system in order to
  access input data and output the results. This paper shows how to optimize
  this algorithm adapting the underlying storage system to this problem through
  the usage of hints and parallel features.}
}

@TechReport{perez:clfs,
  author = {F. {P\'erez} and J. Carretero and P. de~Miguel and F. {Garc\'{\i}a}
  and L. Alonso},
  title = {{CLFS} Design: A Parallel File Manager for Multicomputers},
  year = {1994},
  number = {FIM/82.1/DATSI/94},
  institution = {Universidad Politecnic Madrid},
  address = {Madrid, Spain},
  URL = {http://laurel.datsi.fi.upm.es/~gp/publications/datsi82.1.ps.Z},
  keywords = {multiprocessor file system, parallel I/O, pario-bib},
  abstract = {This document describes the detailed design of the CLFS, one of
  the components of the Cache Coherent File System (CCFS). CCFS has three main
  components: Client File Server (CLFS), Local File Server (LFS), Concurrent
  Disk System (CDS). The Client File Servers are located on each processing
  node, to develop file manager functions in a per node basis. The CLFS will
  interact with the LFSs to provide block services, naming, locking, real
  input/output and to manage the disk system, partitions, distributed
  partitions, etc. The CLFS includes a standard POSIX interface (internally
  parallelized) and some parallel extensions It will be responsible of
  maintaining cache consistency, distributing accesses to servers, providing a
  file system interface to the user, etc.},
  comment = {See carretero:*, rosales:cds, perez:clfs.}
}

@InProceedings{perez:cooperation,
  author = {Mar\'{i}a S. P\'{e}rez and Alberto S\'{a}nchez and V\'{i}ctor
  Robles and Jos\'{e} M. Pe{\~{n}}a and Jemal Abawajy},
  title = {Cooperation model of a multiagent parallel file system for
  clusters},
  booktitle = {Proceedings of the Fouth IEEE/ACM International Symposium on
  Cluster Computing and the Grid},
  year = {2004},
  month = {April},
  pages = {595--601},
  institution = {Univ Politecn Madrid, Fac Informat, Madrid, Spain},
  publisher = {IEEE Computer Society Press},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  address = {Chicago, IL},
  keywords = {multi-agent parallel file system, pario-bib}
}

@Article{perez:evaluate,
  author = {F. Perez and J. Carretero and L. Alonso and P. {De Miguel} and F.
  Garcia},
  title = {Evaluating {ParFiSys}: A high-performance parallel and distributed
  file system},
  journal = {Journal of Systems Architecture},
  year = {1997},
  month = {May},
  volume = {43},
  number = {8},
  pages = {533--542},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  abstract = {We present an overview of ParFiSys, a coherent parallel file
  system developed at the UPM to provide I/O services to the GPMIMD machine, an
  MPP built within the ESPRIT project P-5404. Special emphasis is made on the
  results obtained during ParFiSys evaluation. They were obtained using several
  I/O benchmarks (PARKBENCH, IOBENCH, etc.) and several MPP platforms (T800,
  T9000, etc.) to cover a big spectrum of the ParFiSys features, being
  specifically oriented to measure throughput for scientific applications I/O
  patterns. ParFiSys is specially well suited to provide I/O services to
  scientific applications requiring high I/O bandwidth, to minimize application
  porting effort, and to exploit the parallelism of generic message-passing
  multicomputers.}
}

@Article{perez:gridexpand,
  author = {Jos\'{e} M. P\'{e}rez and F\'{e}lix Garcia and Jes\'{u}s Carretero
  and Alejandro Calder\'{o}n and Javier Fern\'{a}ndez},
  title = {A Parallel {I/O} Middleware to Integrate Heterogeneous Storage
  Resources on Grids},
  journal = {Lecture Notes in Computer Science},
  booktitle = {1st European Across Grids Conference; February 13-14, 2003;
  Santiago de Compostela, SPAIN},
  editor = {Rivera, FF; Bubak, M; Tato, AG; Doallo, R},
  year = {2004},
  month = {March},
  volume = {2970},
  pages = {124--131},
  institution = {Univ Carlos III Madrid, Comp Architecture Grp, Dept Comp Sci,
  Madrid, Spain},
  publisher = {Springer-Verlag},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL = {http://grid.cesga.es/eabstracts/xpn-data-grid.pdf},
  keywords = {Data Grids, Parallel I/O, data declustering, High performance
  I/O, pario-bib},
  abstract = {The philosophy behind grid is to use idle resources to achieve a
  higher level of computational services (computation, storage, etc). Existing
  data grids solutions are based in new servers, specific APIs and protocols,
  however this approach is not a realistic solution for enterprises and
  universities, because this supposes the deployment of new data servers across
  the company. This paper describes a new approach to data access in
  computational grids. This approach is called GridExpand, a parallel I/O
  middleware that integrates heterogeneous data storage resources in grids. The
  proposed grid solution integrates available data network solutions (NFS,
  CIFS, WebDAV) and makes possible the access to a global grid file system. Our
  solution differs from others because it does not need the installation of new
  data servers with new protocols. Most of the data grid solutions use
  replication as the way to obtain high performance. Replication, however,
  introduce consistency problem for many collaborative applications, and
  sometimes requires the usage of lots of resources. To obtain high
  performance, we apply the parallel I/O techniques used in parallel file
  systems.},
  comment = {A short paper describing an adaptation of the Expand parallel file
  system for data grids. Also see the related paper garcia:expand-design.}
}

@Article{perez:hints,
  author = {Mar\'{i}a S. P\'{e}rez and Albert S\'{a}nchez and V\'{\i}ctor
  Robles and Jos\'{e} Pe{\~{n}}a and Fernando P\'{e}rez},
  title = {Optimizations based on hints in a parallel file system},
  journal = {Lecture Notes in Computer Science},
  booktitle = {4th International Conference on Computational Science (ICCS
  2004); June 6-9, 2004; Krakow, POLAND},
  editor = {Bubak, M; VanAlbada, GD; Sloot, PMA; Dongarra, JJ},
  year = {2004},
  month = {June},
  volume = {3038},
  pages = {347--354},
  institution = {Univ Politecn Madrid, DATSI FI, E-28040 Madrid, Spain},
  publisher = {Springer-Verlag},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL =
  {http://springerlink.metapress.com/openurl.asp?genre=article&issn=0302-9743&volume=3038&spage=347},
  keywords = {parallel I/O, optimizations, caching, prefetching, hints,
  pario-bib},
  abstract = {Existing parallel file systems provide applications a little
  control for optimizing I/O accesses. Most of these systems use optimization
  techniques transparent to the applications, limiting the performance achieved
  by these solutions. Furthermore, there is a big gap between the interface
  provided by parallel file systems and the needs of applications. In fact,
  most of the parallel file systems do not use intuitive I/O hints or other
  optimizations approaches. In this sense, applications programmers cannot take
  advantage of optimization techniques suitable for the application domain.
  This paper describes I/O optimizations techniques used in MAPFS, a multiagent
  I/O architecture. These techniques are configured by means of a double
  interface for specifying access patterns or hints that increase the
  performance of I/O operations. An example of this interface is shown.}
}

@InCollection{pfister:infiniband,
  author = {Gregory F. Pfister},
  title = {An Introduction to the {InfiniBand} Architecture},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {42},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {617--632},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O architecure, pario-bib},
  comment = {Part of jin:io-book.}
}

@InProceedings{philippsen:triton,
  author = {Michael Philippsen and Thomas M. Warschko and Walter F. Tichy and
  Christian G. Herter},
  title = {{Project Triton:} Towards improved Programmability of Parallel
  Machines},
  booktitle = {Proceedings of the Twenty-Sixth Annual Hawaii International
  Conference on System Sciences},
  year = {1993},
  volume = {I},
  pages = {192--201},
  keywords = {parallel programming, parallel architecture, parallel I/O,
  pario-bib},
  comment = {A language- and application-driven proposal for parallel
  architecture, that mixes SIMD and MIMD, high-performance networking, large
  memory, shared address space, and so forth. Fairly convincing arguments. One
  disk per node. Little mention of a file system though. Email from student Udo
  Boehm:``We use in the version of Triton/1 with 256 PE's 72 Disks at the
  moment (the filesystem is scalable up to 256 Disks). These Disks are divided
  into 8 Groups with 9 Disks. In each group exists one parity disk. Our
  implementation of the filesystem is an parallel version of RAID Level 3 with
  some extensions. We use so called vector files for diskaccess. A file is
  always distributed over all disks of the diskarray. A vectorfile is divided
  in logical blocks. A logical block exist of 72 physical blocks, each block is
  on one of the 72 disks and all these 72 physical blocks have the same
  blocknumber on each disk. A logical block has 18432 Bytes, where 16384 Bytes
  are for Data. The filesystem uses these logical blocks to save data. We do
  not use special PE's for the I/O. All PE's can be (are) used to do I/O !
  There exists no central which coordinates the PE's.''}
}

@InProceedings{pierce:pario,
  author = {Paul Pierce},
  title = {A Concurrent File System for a Highly Parallel Mass Storage System},
  booktitle = {Proceedings of the Fourth Conference on Hypercube Concurrent
  Computers and Applications},
  year = {1989},
  month = {March},
  pages = {155--160},
  publisher = {Golden Gate Enterprises, Los Altos, CA},
  address = {Monterey, CA},
  keywords = {parallel I/O, hypercube, Intel iPSC/2, multiprocessor file
  system, pario-bib},
  comment = {Intel iPSC/2 Concurrent File System. Chose to tailor system for
  high performance for large files, read in large chunks. Uniform logical file
  system view, Unix stdio interface. Blocks scattered over all disks, but not
  striped. Blocksize 4K optimizes message-passing performance without using
  blocks that are too big. Tree-directory is stored in ONE file and managed by
  ONE process, so opens are bottlenecked, but that is not their emphasis. File
  headers, however, are scattered. The file header info contains a list of
  blocks. File header is managed by disk process on its I/O node. Data caching
  is done only at the I/O node of the originating disk drive. Read-ahead is
  used but not detailed here.}
}

@Article{pinkenburg:tpo++,
  author = {Simon Pinkenburg and Wolfgang Rosenstiel},
  title = {Parallel {I/O} in an object-oriented message-passing library},
  journal = {Lecture Notes in Computer Science},
  booktitle = {11th European Parallel Virtural Machine and Message Passing
  Interface Users Group Meeting; September 19-22, 2004; Budapest, HUNGARY},
  editor = {Kranzlmuller, D; Kacsuk, P; Dongarra, J},
  year = {2004},
  month = {November},
  volume = {3241},
  pages = {251--258},
  institution = {Univ Tubingen, Dept Comp Engn, Wilhelm Schickard Inst
  Informat, Sand 13, D-72076 Tubingen, Germany; Univ Tubingen, Dept Comp Engn,
  Wilhelm Schickard Inst Informat, D-72076 Tubingen, Germany},
  publisher = {Springer-Verlag Heidelberg},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL = {http://springerlink.metapress.com/link.asp?id=91qfhjbyrbgb7mhw},
  keywords = {object-oriented message passing, TPO++, parallel I/O interface,
  pario-bib},
  abstract = {The article describes the design and implementation of parallel
  I/O in the object-oriented message-passing library TPO++. TPO++ is
  implemented on top of the message passing standard MPI and provides an
  object-oriented, type-safe and data centric interface to message-passing.
  Starting with version 2, the MPI standard defines primitives for parallel I/O
  called MPI-IO. Based on this layer, we have implemented an object-oriented
  parallel I/O interface in TPO++. The project is part of our efforts to apply
  object-oriented methods to the development of parallel physical simulations.
  We give a short introduction to our message-passing library and detail its
  extension to parallel I/O. Performance measurements between TPO++ and MPI are
  compared and discussed.}
}

@InProceedings{piriyakumar:enhanced,
  author = {Douglas Antony Louis Piriyakumar and Paul Levi and Rolf
  Rabenseifner},
  title = {Enhanced File Interoperability with Parallel {MPI} File-{I/O} in
  Image Processing},
  booktitle = {Recent Advances in Parallel Virtual Machine and Message Passing
  Interface},
  year = {2002},
  series = {Lecture Notes in Computer Science},
  volume = {2474},
  pages = {174--?},
  publisher = {Springer-Verlag},
  URL =
  {http://link.springer.de/link/service/series/0558/bibs/2474/24740174.htm},
  URLpdf =
  {http://link.springer.de/link/service/series/0558/papers/2474/24740174.pdf},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  abstract = {One of the crucial problems in image processing is Image
  Matching, i.e., to match two images, or in our case, to match a model with
  the given image. This problem being highly computation intensive, parallel
  processing is essential to obtain the solutions in time due to real world
  constraints. The Hausdorff method is used to locate human beings in images by
  matching the image with models and is parallelized with MPI. The images are
  usually stored in files with different formats. As most of the formats can be
  converted into ASCII file format containing integers, we have implemented 3
  strategies namely, Normal File Reading, Off-line Conversion and Run-time
  Conversion for free format integer file reading and writing. The
  parallelization strategy is optimized so that I/O overheads are minimal. The
  relative performances with multiple processors are tabulated for all the
  cases and discussed. The results obtained demonstrate the efficiency of our
  strategies and the implementations will enhance the file interoperability
  which will be useful for image processing community to use parallel systems
  to meet the real time constraints.}
}

@TechReport{poole:sio-survey,
  author = {James T. Poole},
  title = {Preliminary Survey of {I/O} Intensive Applications},
  year = {1994},
  number = {CCSF-38},
  institution = {Scalable I/O Initiative},
  address = {Caltech Concurrent Supercomputing Facilities, Caltech},
  URL = {http://www.cacr.caltech.edu/SIO/pubs/SIO_apps.ps},
  keywords = {parallel I/O, pario-bib, multiprocessor file system, file access
  pattern, checkpoint},
  comment = {Goal is to collect a set of representative applications from
  biology, chemistry, earth science, engineering, graphics, and physics, use
  performance-monitoring tools to analyze them, create templates and benchmarks
  that represent them, and then later to evaluate the performance of new I/O
  tools created by rest of the SIO initiative. Seem to be four categories of
  I/O needs: input, output, checkpoint, and virtual memory (``out-of-core''
  scratch space). Not all types are significant in all applications. (Two
  groups mention databases and the need to perform computationally complex
  queries.) Large input is typically raw data (seismic soundings, astronomical
  observations, satellite remote sensing, weather information). Sometimes there
  are real-time constraints. Output is often periodic, e.g., the state of the
  system every few timesteps; typically the volume would increase along with
  I/O capacity and bandwidth. Checkpointing is a common request; preferably
  allowing application to choose what and when to checkpoint, and definitely
  including the state of files. Many kinds of out-of-core: 1) temp files
  between passes (often written and read sequentially), 2) regular patterns
  like FFT, matrix transpose, solvers, and single-pass read/compute/write, 3)
  random access, e.g., to precomputed tables of integrals. Distinct differences
  in the ways people choose to divide data into files; sometimes all in one
  huge file, sometimes many ``small'' files (e.g., one per processor, one per
  timestep, one per region, etc.). Important: overlap of computation and I/O,
  independent access by individual processors. Not always important: ordering
  of records read or written by different processors, exposing the I/O model to
  the application writer. Units of I/O seem to be either (sub)matrices (1--5
  dimensions) or items in a collection of objects (100--10000 bytes each). Data
  sets varied up to 1~TB; bandwidth needs varied up to 1~GB/s. See also
  bagrodia:sio-character, choudhary:sio-language, bershad:sio-os.}
}

@InProceedings{poplawski:simulation,
  author = {Anna L. Poplawski and David M. Nicol},
  title = {An Investigation of Out-of-Core Parallel Discrete-Event Simulation},
  booktitle = {Proceedings of the Winter Simulation Conference},
  year = {1999},
  month = {December},
  pages = {524--530},
  publisher = {IEEE Computer Society Press},
  copyright = {IEEE},
  URL = {http://www.cs.dartmouth.edu/research/DaSSF/papers/WSC99-ooc.ps},
  keywords = {discrete-event simulation, parallel computing, out-of-core
  application, parallel I/O, pario-bib},
  abstract = {In large-scale discrete-event simulations the size of a
  computer's physical memory limits the size of the system to be simulated.
  Demand paging policies that support virtual memory are generally ineffective.
  Use of parallel processors to execute the simulation compounds the problems,
  as memory can be tied down due to synchronization needs. We show that by
  taking more direct control of disks it is possible to break through the
  memory bottleneck, without significantly increasing overall execution time.
  We model one approach to conducting out-of-core parallel simulation,
  identifying relationships between execution, memory, and I/O costs that admit
  good performance.}
}

@InProceedings{poston:hpfs,
  author = {Alan Poston},
  title = {A High Performance File System for {UNIX}},
  booktitle = {Proceedings of the USENIX Workshop on UNIX and Supercomputers},
  year = {1988},
  pages = {215--226},
  keywords = {file system, unix, parallel I/O, disk striping, pario-bib},
  comment = {A new file system for Unix based on striped files. Better
  performance for sequential access, better for large-file random access and
  about the same for small-file random access. Allows full striping track
  prefetch, or even volume prefetch. Needs a little bit of buffer management
  change. Talks about buffer management and parity blocks.}
}

@InProceedings{prabhakar:browsing,
  author = {Sunil Prabhakar and Divyakant Agrawal and Amr {El Abbadi} and Ambuj
  Singh and Terence Smith},
  title = {Browsing and Placement of Multiresolution Images on Parallel Disks},
  booktitle = {Proceedings of the Fifth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1997},
  month = {November},
  pages = {102--113},
  publisher = {ACM Press},
  address = {San Jose, CA},
  URL = {http://doi.acm.org/10.1145/266220.266230},
  keywords = {multimedia, parallel I/O, pario-bib},
  abstract = {With rapid advances in computer and communication technologies,
  there is an increasing demand to build and maintain large image repositories.
  In order to reduce the demands on I/O and network resources, multiresolution
  representations are being proposed for the storage organization of images.
  Image decomposition techniques such as {\em wavelets} can be used to provide
  these multiresolution images. The original image is represented by several
  coefficients, one of them with visual similarity to the original image, but
  at a lower resolution. These visually similar coefficients can be thought of
  as {\em thumbnails} or {\em icons} of the original image. This paper
  addresses the problem of storing these multiresolution coefficients on disks
  so that thumbnail browsing as well as image reconstruction can be performed
  efficiently. Several strategies are evaluated to store the image coefficients
  on parallel disks. These strategies can be classified into two broad classes
  depending on whether the access pattern of the images is used in the
  placement. Disk simulation is used to evaluate the performance of these
  strategies. Simulation results are validated with results from experiments
  with real disks and are found to be in good agreement. The results indicate
  that significant performance improvements can be achieved with as few as four
  disks by placing image coefficients based upon browsing access patterns.},
  comment = {They use simulation to study several different placement policies
  for the thumbnail and varying-resolution versions of images on a disk array.}
}

@InProceedings{pratt:twofs,
  author = {Terrence W. Pratt and James C. French and Phillip M. Dickens and
  Janet, Jr., Stanley A.},
  title = {A Comparison of the Architecture and Performance of Two Parallel
  File Systems},
  booktitle = {Proceedings of the Fourth Conference on Hypercube Concurrent
  Computers and Applications},
  year = {1989},
  pages = {161--166},
  publisher = {Golden Gate Enterprises, Los Altos, CA},
  address = {Monterey, CA},
  keywords = {parallel I/O, Intel iPSC/2, nCUBE, pario-bib},
  comment = {Simple comparison of the iPSC/2 and nCUBE/10 parallel I/O systems.
  Short description of each system, with simple transfer rate measurements. See
  also french:ipsc2io-tr.}
}

@InProceedings{preslan:gfs,
  author = {Kenneth W. Preslan and Andrew P. Barry and Jonathan E. Brassow and
  Grant M. Erickson and Erling Nygaard and Christopher J. Sabol and Steven R.
  Soltis and David C. Teigland and Matthew T. O'Keefe},
  title = {A 64-bit, Shared Disk File System for {Linux}},
  booktitle = {Proceedings of the Seventh NASA Goddard Conference on Mass
  Storage Systems and Technologies},
  year = {1999},
  month = {March},
  pages = {22--41},
  publisher = {IEEE Computer Society Press},
  address = {San Diego, CA},
  URL = {http://gfs.lcse.umn.edu/pubs/NASA_GFS_1999.pdf},
  keywords = {Linux, shared file system, network-attached disks, disk striping,
  parallel I/O, pario-bib},
  comment = {They discuss a shared, serverless, file system for Linux that
  integrates IP-based network attached storage and Fibre-Channel- based storage
  area networks. Based on soltis:gfs.}
}

@TechReport{prost:mpi-io,
  author = {Jean-Pierre Prost and Marc Snir and Peter Corbett and Dror
  Feitelson},
  title = {{MPI-IO,} A Message-Passing Interface for Concurrent {I/O}},
  year = {1994},
  month = {August},
  number = {RC~19712 (87394)},
  institution = {IBM T.J. Watson Research Center},
  keywords = {parallel I/O, message-passing, multiprocesor file system
  interface, pario-bib},
  comment = {See newer version mpi-ioc:mpi-io5.}
}

@Booklet{rab:raidbook,
  key = {RAB},
  title = {The {RAIDBook}: A Source Book for {RAID} Technology},
  year = {1993},
  month = {June 9},
  howpublished = {The RAID Advisory Board},
  address = {Lino Lakes, MN},
  note = {First Edition},
  keywords = {RAID, disk array, parallel I/O, pario-bib},
  comment = {Basically, an educational piece about the basics of RAID
  technology. Helps to define terms across the industry. Written by the RAID
  advisory board, which is an industry consortium. Overviews RAID, RAID levels,
  non-Berkeley RAID levels. List of Board members. Bibliography.}
}

@InCollection{rabenseifner:benchmark,
  author = {Rolf Rabenseifner and Alice E. Koniges and Jean-Pierre Prost and
  Richard Hedges},
  title = {The Parallel Effective {I/O} Bandwidth Benchmark: b\_eff\_io},
  booktitle = {Parallel I/O for Cluster Computing},
  chapter = {4},
  editor = {Christophe Cerin and Hai Jin},
  year = {2004},
  month = {February},
  pages = {107--132},
  publisher = {Kogan Page Ltd.},
  URL = {http://www.hlrs.de/people/rabenseifner/publ/cpj_b_eff_io_nov19.pdf},
  keywords = {parallel I/O benchmarks, MPI-IO, pario-bib}
}

@MastersThesis{rajaram:thesis,
  author = {Kumaran Rajaram},
  title = {Principal Design Criteria influencing the Performance of a Portable,
  High Performance Parallel {I/O} Implementation},
  year = {2002},
  month = {May},
  school = {Department of Computer Science, Mississippi State University},
  URL = {http://library.msstate.edu/etd/show.asp?etd=etd-04052002-105711},
  keywords = {MPI-IO, MPI, parallel I/O, pario-bib}
}

@Article{rajasekaran:out-of-core,
  author = {Sanguthevar Rajasekaran},
  title = {Out-of-core computing on mesh connected computers},
  journal = {Journal of Parallel and Distributed Computing},
  year = {2004},
  month = {November},
  volume = {64},
  number = {11},
  pages = {1311--1317},
  institution = {Univ Connecticut, Dept CSE, 371 Fairfield Rd, ITEB 257,
  Storrs, CT 06269 USA; Univ Connecticut, Dept CSE, Storrs, CT 06269 USA},
  publisher = {Academic Press Inc. Elsevier Science},
  copyright = {(c)2004 Elsevier Engineering Information, Inc.; The Thomson
  Corporation},
  URL = {http://dx.doi.org/10.1016/j.jpdc.2004.08.003},
  keywords = {out-of-core, sorting, parallel disk model, performance analysis,
  pario-bib},
  abstract = {Several models of parallel disks are found in the literature.
  These models have been proposed to alleviate the I/O bottleneck arising in
  handling voluminous data. These models have the general theme of assuming
  multiple disks. For instance the parallel disks model assumes D disks and a
  single computer. It is also assumed that a block of data from each of the D
  disks can be fetched into the main memory in one parallel I/O operation. In
  this paper, we study a model where there are more than one processors and
  each processor has an associated disk. In addition to the I/O cost, one also
  has to account for the inter-processor communication costs. To begin with we
  study the mesh and we investigate the performance of the mesh with respect to
  out-of-core computing. As a case study we consider the problem of sorting.
  The goal of this paper is to study the properties of this model. CP 2004
  Elsevier Inc. All rights reserved. (27 Refs.)}
}

@Article{rajasekaran:selection,
  author = {Sanguthevar Rajasekaran},
  title = {Selection Algorithms for Parallel Disk Systems},
  journal = {Journal of Parallel and Distributed Computing},
  year = {2001},
  month = {April},
  volume = {61},
  number = {4},
  pages = {536--544},
  publisher = {Academic Press},
  URL = {http://www.idealibrary.com/links/doi/10.1006/jpdc.2000.1682},
  keywords = {I/O algorithms, parallel I/O, pario-bib},
  abstract = {With the widening gap between processor speeds and disk access
  speeds, the I/O bottleneck has become critical. Parallel disk systems have
  been introduced to alleviate this bottleneck. In this paper we present
  deterministic and randomized selection algorithms for parallel disk systems.
  The algorithms to be presented, in addition to being asymptotically optimal,
  have small underlying constants in their time bounds and hence have the
  potential of being practical.}
}

@MastersThesis{ramachandran:msthesis,
  author = {Harish Ramachandran},
  title = {Design and implementation of the system interface for {PVFS2}},
  year = {2002},
  month = {December},
  school = {Clemson University},
  URL = {ftp://ftp.parl.clemson.edu/pub/techreports/2002/PARL-2002-008.ps},
  keywords = {pvfs, parallel file system, system interface, pario-bib},
  abstract = {As Linux clusters emerged as an alternative to traditional
  supercomputers one of the problems faced was the absence of a
  high-performance parallel file system comparable to the file systems on the
  commercial machines. The Parallel Virtual FileSystem(PVFS) developed at
  Clemson University has attempted to address this issue. PVFS is a parallel
  file system currently used in Parallel I/O research and as a parallel file
  system on Linux clusters running high-performance parallel applications. \par
  An important component of parallel file systems is the file system interface
  which has different requirements compared to the normal UNIX interface
  particularly the I/O interface. A parallel I/O interface is required to
  provide support for non-contiguous access patterns, collective I/O, large
  file sizes in order to achieve good performance with parallel applications.
  As it supports significantly different functionality, the interface exposed
  by a parallel file system assumes importance. So, the file system needs to
  either directly provide a parallel I/O interface or at the least support for
  such an interface to be implemented on top. \par The PVFS2 System Interface
  is the native file system interface for PVFS2 - the next generation of PVFS.
  The System Interface provides support for multiple interfaces such as a POSIX
  interface or a parallel I/O interface like MPI-IO to access PVFS2 while also
  allowing the benefits of abstraction by decoupling the System Interface from
  the actual file system implementation. This document discusses the design and
  implementation of the System Interface for PVFS2.}
}

@InProceedings{rauch:partitioncast,
  author = {Felix Rauch and Christian Kurmann and Thomas M. Stricker},
  title = {Partition Cast--- Modelling and Optimizing the Distribution of Large
  Data Sets in {PC} Clusters},
  booktitle = {Proceedings of the Sixth International Euro-Par Conference},
  year = {2000},
  month = {August},
  series = {Lecture Notes in Computer Science},
  volume = {1900},
  pages = {1118--1131},
  publisher = {Springer-Verlag},
  address = {Munich},
  URL = {http://www.cs.inf.ethz.ch/CoPs/report343/report343.pdf},
  keywords = {multicast, network, cluster, parallel I/O, pario-bib},
  abstract = {Multicasting large amounts of data efficiently to all nodes of a
  PC cluster is an important operation. In the form of a partition cast it can
  be used to replicate entire software installations by cloning. Optimizing a
  partition cast for a given cluster of PCs reveals some interesting
  architectural tradeoffs, since the fastest solution does not only depend on
  the network speed and topology, but remains highly sensitive to other
  resources like the disk speed, the memory system performance and the
  processing power in the participating nodes. We present an analytical model
  that guides an implementation towards an optimal configuration for any given
  PC cluster. The model is validated by measurements on our cluster using
  Gigabit- and Fast Ethernet links. The resulting simple software tool, Dolly,
  can replicate an entire 2 GByte Windows NT image onto 24 machines in less
  than 5 minutes.}
}

@PhdThesis{rauch:thesis,
  author = {Felix Rauch},
  title = {Distribution and Storage of Data on Local and Remote Disks in
  Multi-Use Clusters of PCs},
  year = {2003},
  school = {Dept. of Computer Science, Swiss Federal Institute of Technology
  (ETH Zurich)},
  address = {Zurich, Switzerland},
  note = {Full online publication to follow},
  URL = {http://www.cs.inf.ethz.ch/~rauch/diss/},
  keywords = {Cluster of PCs, commodity computing, data streams, multicast,
  cloning, data storage, distributed file systems, distributed devices,
  network-attached disks, OS image distribution, pario-bib},
  abstract = {Over the last few decades, the power of personal computers (PCs)
  has grown steadily, following the exponential growth rate predicted by
  Moore's law. The trend towards the commoditization of PC components (such as
  CPUs, memories, high-speed interconnects and disks) results in a highly
  attractive price/performance ratio of the systems built from those
  components. Following these trends, I propose to integrate the commodity IT
  resources of an entire company or organziation into {\em multi-use clusters
  of commodity PCs\/}. These include compute farms, experimental clusters as
  well as desktop PCs in offices and labs. This thesis follows a bottom-up
  architectural approach and deals with hardware and system-software
  architecture with a tight focus on performance and efficiency. In contrast,
  the Grid view of providing services instead of hardware for storage and
  computation deals mostly with problems of capability, service and security
  rather than performance and modelling thereof. \par Multi-use clusters of
  commodity PCs have by far enough storage on their hard-disk drives for the
  required local operating-system (OS) installation and therfore there is a lot
  of excess storage in a multi-use cluster. This additional disk space on the
  nodes should be put to a better use for a variety of interesting applications
  e.g.\ for on-line analytic data processing (OLAP). The specific contributions
  of the thesis include solutions to four important problems of optimized
  resource usage in multi-use-cluster environments. \par Analytic models of
  computer systems are important to understand the performance of current
  systems and to predict the performance of future systems early in the design
  stage. The thesis instroduces a simple {\em analytic model of data streams in
  clusters\/}. The model considers the topology of data streams as well as the
  limitations of the edges and nodes. It also takes into account the
  limitations of the resources within the nodes, which are passed through by
  the data streams. \par Using the model, the thesis evaluates different
  data-casting techniques that can be used to replicate OS installations to
  many nodes in clusters. The different implementations based on IP multicast,
  star-, tree- and multi-drop--chain topologies are evaluated with the analytic
  model as well as with experimental measurements. As a result of the
  evaluation, the {\em multi-drop chain\/} is proposed as most suitable
  replication technique. \par When working with multi-use clusters, we noticed
  that maintenance of the highly replicated system software is difficult,
  because there are many OS installations in different versions and
  customisations. Since it is desirable to backup all older versions and
  customisations of all OS installations, I implemented several techniques to
  archive the large amounts of highly redundant data contained in the nodes' OS
  partitions. The techniques take different approaches of comparing the data,
  but are all OS independent and work with whole partition images. The {\em
  block repositories\/} that store only {\em unique\/} data blocks prove to be
  an efficient data storage for OS installations in multi-use clusters. \par
  Finally we look at the possibilities to take advantage of the excess storage
  on the many nodes' hard-disk drives. The thesis investigates several ways to
  gather data from multiple server nodes to a client node running the
  applications. The combined storage can be used for data-warehousing
  applications. While powerful multi-CPU ``killer workstations'' with redundant
  arrays of inexpensive disks (RAIDs) are the current workhorses for data
  warehousing because of their compatibility with standard databases, they are
  still expensive compared to multi-use clusters of commodity PCs. On the other
  end several researchers in databases have tried to find domain specific
  solutions using middleware. My thesis looks at the question whether, and to
  what extent, the cost-efficient multi-use clusters of commodity PCs can
  provide an alternative data-warehousing platform with an OS solution that is
  transparent enough to run a commodity database system. To answer the question
  about the most suitable software layer for a possible implementation, the
  thesis {\em compares different distributed file systems and
  distributed-device systems\/} against the middleware solution that uses
  database-internal communication for distributing partial queries. The
  different approaches are modelled with the analytic model and evaluated with
  a microbenchmark as well as the TPC-D decision-support benchmark. \par Given
  the existing systems and software packages it looks like the domain specific
  mid\-dle\-ware-approach delivers best performance, and in the area of the
  transparent OS-only solution, distributed devices are faster than the more
  complex distributed file systems and achieve similar performance to a system
  with local disks only.},
  comment = {See also rauch:partitioncast}
}

@InProceedings{reddy:compiler,
  author = {A. L. Narasimha Reddy and P. Banerjee and D. K. Chen},
  title = {Compiler Support for Parallel {I/O} Operations},
  booktitle = {Proceedings of the 1991 International Conference on Parallel
  Processing},
  year = {1991},
  pages = {II:290--II:291},
  publisher = {CRC Press},
  address = {St. Charles, IL},
  earlier = {reddy:compiler-tr},
  keywords = {parallel I/O, pario-bib, compilers},
  comment = {This version is only 2 pages. reddy:compiler-tr provides the full
  text. They discuss three primary issues. 1) Overlapping I/O with computation:
  the compiler's dependency analysis is used to decide when some I/O may be
  moved up and performed asynchronously with other computation. 2) Parallel
  execution of I/O statements: {\em if} all sizes are known at compile time,
  the compiler can insert seeks so that processes can access the file
  independently. When writing in the presence of conditionals they even propose
  skipping by the maximum and leaving holes in the file, and they claim that
  this doesn't hurt (!). 3) Parallel format conversion: again, if there are
  fixed-width fields the compiler can have processors seek to different
  locations, read data independently, and do format conversion in parallel.
  Really all this is saying is that fixed-width fields are good for
  parallelism, and that compilers could take advantage of them.}
}

@TechReport{reddy:compiler-tr,
  author = {A. L. Narasimha Reddy and P. Banerjee and D. K. Chen},
  title = {Compiler Support for Parallel {I/O} Operations},
  year = {1991},
  institution = {IBM Yorktown Heights},
  note = {Also appeared in ICPP '91},
  later = {reddy:compiler},
  keywords = {parallel I/O, pario-bib, compilers}
}

@InProceedings{reddy:hyperio1,
  author = {A. L. Reddy and P. Banerjee and Santosh G. Abraham},
  title = {{I/O} Embedding in Hypercubes},
  booktitle = {Proceedings of the 1988 International Conference on Parallel
  Processing},
  year = {1988},
  volume = {1},
  pages = {331--338},
  publisher = {Pennsylvania State Univ. Press},
  address = {St. Charles, IL},
  later = {reddy:hyperio3},
  keywords = {parallel I/O, hypercube, pario-bib},
  comment = {Emphasis is on adjacency. It also implies (and they assume) that
  data is distributed well across the disks so no data needs to move beyond the
  neighbors of an I/O node. Still, the idea of adjacency is good since it
  allows for good data distribution while not requiring it, and for balancing
  I/O procs among procs in a good way. Also avoids messing up the hypercube
  regularity with (embedded) dedicated I/O nodes.}
}

@InProceedings{reddy:hyperio2,
  author = {A. L. Reddy and P. Banerjee},
  title = {{I/O} issues for hypercubes},
  booktitle = {ACM International Conference on Supercomputing},
  year = {1989},
  pages = {72--81},
  later = {reddy:hyperio3},
  keywords = {parallel I/O, hypercube, pario-bib},
  comment = {See reddy:hyperio3 for extended version.}
}

@Article{reddy:hyperio3,
  author = {A. L. Narasimha Reddy and Prithviraj Banerjee},
  title = {Design, Analysis, and Simulation of {I/O} Architectures for
  Hypercube Multiprocessors},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {1990},
  month = {April},
  volume = {1},
  number = {2},
  pages = {140--151},
  publisher = {IEEE Computer Society Press},
  earlier = {reddy:hyperio1},
  keywords = {parallel I/O, hypercube, pario-bib},
  comment = {An overall paper restating their embedding technique from
  reddy:hyperio1, plus a little bit of evaluation along the lines of
  reddy:pario2, plus some ideas about matrix layout on the disks. They claim
  that declustering is important, since synchronized disks do not provide
  enough parallelism, especially in the communication across the hypercube
  (since the synchronized disks must hang off one node).}
}

@InProceedings{reddy:pario,
  author = {A. Reddy and P. Banerjee},
  title = {An Evaluation of multiple-disk {I/O} systems},
  booktitle = {Proceedings of the 1989 International Conference on Parallel
  Processing},
  year = {1989},
  pages = {I:315--322},
  publisher = {Pennsylvania State Univ. Press},
  address = {St. Charles, IL},
  later = {reddy:pario2},
  keywords = {parallel I/O, disk array, disk striping, pario-bib},
  comment = {see also expanded version reddy:pario2}
}

@Article{reddy:pario2,
  author = {A. Reddy and P. Banerjee},
  title = {Evaluation of multiple-disk {I/O} systems},
  journal = {IEEE Transactions on Computers},
  year = {1989},
  month = {December},
  volume = {38},
  pages = {1680--1690},
  publisher = {IEEE Computer Society Press},
  earlier = {reddy:pario},
  later = {reddy:pario3},
  keywords = {parallel I/O, disk array, disk striping, pario-bib},
  comment = {Compares declustered disks (sortof MIMD-like) to
  synchronized-interleaved (SIMD-like). Declustering needed for scalability,
  and is better for scientific workloads. Handles large parallelism needed for
  scientific workloads and for RAID-like architectures. Synchronized
  interleaving is better for general file system workloads due to better
  utilization and reduction of seek overhead.}
}

@Article{reddy:pario3,
  author = {A. L. Reddy and Prithviraj Banerjee},
  title = {A Study of Parallel Disk Organizations},
  journal = {Computer Architecture News},
  year = {1989},
  month = {September},
  volume = {17},
  number = {5},
  pages = {40--47},
  earlier = {reddy:pario2},
  keywords = {parallel I/O, disk array, disk striping, pario-bib},
  comment = {nothing new over expanded version reddy:pario2, little different
  from reddy:pario}
}

@InProceedings{reddy:perfectio,
  author = {A. L. Narasimha Reddy and Prithviraj Banerjee},
  title = {A Study of {I/O} Behavior of {Perfect} Benchmarks on a
  Multiprocessor},
  booktitle = {Proceedings of the 17th Annual International Symposium on
  Computer Architecture},
  year = {1990},
  pages = {312--321},
  keywords = {parallel I/O, file access pattern, workload, multiprocessor file
  system, benchmark, pario-bib},
  comment = {Using five applications from the Perfect benchmark suite, they
  studied both implicit (paging) and explicit (file) I/O activity. They found
  that the paging activity was relatively small and that sequential access to
  VM was common. All access to files was sequential, though this may be due to
  the programmer's belief that the file system is sequential. Buffered I/O
  would help to make transfers bigger and more efficient, but there wasn't
  enough rereferencing to make caching useful.}
}

@PhdThesis{reddy:thesis,
  author = {Narasimha {Reddy L. Annapareddy}},
  title = {Parallel Input/Output Architectures for Multiprocessors},
  year = {1990},
  month = {August},
  school = {University of Illinois at Urbana-Champaign},
  note = {Available as technical report UILU-ENG-90-2235 or CRHC-90-5},
  keywords = {parallel I/O, multiprocessor architecture, pario-bib},
  comment = {Much of the material in this thesis has been published in other
  papers, i.e., reddy:io, reddy:notsame, reddy:hyperio1, reddy:hyperio2,
  reddy:hyperio3, reddy:pario, reddy:pario2, reddy:pario3, reddy:perfectio,
  reddy:mmio. He traces some ``Perfect'' benchmarks to determine paging and
  file access patterns. He simulates a variety of declustered, synchronized,
  and synchronized-declustered striping configurations under both ``file'' and
  ``scientific'' workloads to determine which is best. He proposes embeddings
  for I/O nodes in hypercubes, where the I/O nodes are just like regular nodes
  but with an additional I/O processor and disk(s). He studies the disk
  configurations again, when embedded in hypercubes. He proposes ways to lay
  out matrices (in blocked form) across disks in a hypercube. He proposes a new
  parity-based fault-tolerance scheme that prevents overloading during
  failure-mode access. And he considers compiler issues: overlapping I/O with
  computation, parallelizing I/O statements, and parallel format conversion.}
}

@Article{reed:panel,
  author = {Daniel A. Reed and Charles Catlett and Alok Choudhary and David
  Kotz and Marc Snir},
  title = {Parallel {I/O}: Getting Ready for Prime Time},
  journal = {IEEE Parallel and Distributed Technology},
  year = {1995},
  month = {Summer},
  pages = {64--71},
  publisher = {IEEE Computer Society Press},
  copyright = {IEEE},
  note = {Edited transcript of panel discussion at the 1994 International
  Conference on Parallel Processing},
  URL = {http://computer.org/concurrency/pd1995/p2toc.htm},
  URLpdf = {http://www.cs.dartmouth.edu/~dfk/papers/reed:panel.pdf},
  keywords = {parallel I/O, pario-bib, dfk},
  comment = {This paper summarizes the presentations made by panel members at
  the ICPP panel discussion on parallel I/O, and the ensuing discussion.}
}

@Book{reed:sio-book,
  title = {Scalable Input/Output: Achieving System Balance},
  booktitle = {Scalable Input/Output: Achieving System Balance},
  editor = {Daniel A. Reed},
  year = {2003},
  month = {October},
  series = {Scientific and Engineering Computation},
  publisher = {MIT Press},
  URL =
  {http://mitpress.mit.edu/catalog/item/default.asp?sid=D1D9D30E-0282-4C17-95E3-4AAEFD823A1E&ttype=2&tid=10037},
  keywords = {I/O characterization, checkpointing, collective I/O, parallel
  database, I/O optimization, pario-bib}
}

@Article{rettberg:monarch,
  author = {Randall D. Rettberg and William R. Crowther and Philip P. Carvey
  and Raymond S. Tomlinson},
  title = {The {Monarch Parallel Processor} Hardware Design},
  journal = {IEEE Computer},
  year = {1990},
  month = {April},
  volume = {23},
  number = {4},
  pages = {18--30},
  publisher = {IEEE Computer Society Press},
  keywords = {MIMD, parallel architecture, shared memory, parallel I/O,
  pario-bib},
  comment = {This describes the Monarch computer from BBN. It was never built.
  65K processors and memory modules. 65GB RAM. Bfly-style switch in dance-hall
  layout. Switch is synchronous; one switch time is a {\em frame} (one
  microsecond, equal to 3 processor cycles) and all processors may reference
  memory in one frame time. Local I-cache only. Contention reduces full
  bandwidth by 16 percent. Full 64-bit machine. Custom VLSI. Each memory
  location has 8 tag bits. One allows for a location to be locked by a
  processor. Thus, any FetchAndOp or full/empty model can be supported. I/O is
  done by adding I/O processors (up to 2K in a 65K-proc machine) in the switch.
  They plan 200 disks, each with an I/O processor, for 65K nodes. They would
  spread each block over 9 disks, including one for parity (essentially RAID).}
}

@InProceedings{riedel:active-mining,
  author = {Erik Riedel and Garth Gibson and Christos Faloutsos},
  title = {Active storage for large-scale data mining and multimedia.},
  booktitle = {24th Annual International Conference on Very Large Data Bases
  (VLDB'98)},
  editor = {A. Gupta and O. Shmuel and J. Widom},
  year = {1998},
  month = {August},
  pages = {62--73},
  publisher = {Morgan Kaufmann Publishers Inc},
  copyright = {(c)2004 IEE},
  address = {New York, NY},
  URL = {http://www.pdl.cmu.edu/PDL-FTP/Active/VLDB98.pdf},
  keywords = {active disks, active storage, application level code, database
  server, data mining, pario-bib},
  abstract = {The increasing performance and decreasing cost of processors and
  memory are causing system intelligence to move into peripherals from the CPU.
  Storage system designers are using this trend toward "excess" compute power
  to perform more complex processing and optimizations inside storage devices.
  To date, such optimizations have been at relatively low levels of the storage
  protocol. At the same time, trends in storage density, mechanics, and
  electronics are eliminating the bottleneck in moving data off the media and
  putting pressure on interconnects and host processors to move data more
  efficiently. We propose a system called Active Disks that takes advantage of
  processing power on individual disk drives to run application-level code.
  Moving portions of an application's processing to execute directly at disk
  drives can dramatically reduce data traffic and take advantage of the storage
  parallelism already present in large systems today. We discuss several types
  of applications that would benefit from this capability with a focus on the
  areas of database, data mining, and multimedia. We develop an analytical
  model of the speedups possible for scan-intensive applications in an Active
  Disk system. We also experiment with a prototype Active Disk system using
  relatively low-powered processors in comparison to a database server system
  with a single, fast processor. Our experiments validate the intuition in our
  model and demonstrate speedups of 2x on 10 disks across four scan-based
  applications. The model promises linear speedups in disk arrays of hundreds
  of disks, provided the application data is large enough. (57 refs.)}
}

@PhdThesis{riedel:thesis,
  author = {Erik Riedel},
  title = {Active Disks - Remote Execution for Network-Attached Storage},
  year = {1999},
  month = {November},
  school = {Carnegie Mellon University},
  address = {Pittsburgh, PA},
  URL = {http://www.pdl.cs.cmu.edu/ftp/Active/riedel_thesis_abs.html},
  keywords = {storage, active disks, embedded systems, architecture, databases,
  data mining, disk scheduling, pario-bib},
  abstract = {Today's commodity disk drives, the basic unit of storage for
  computer systems large and small, are actually small computers, with a
  processor, memory, and 'network' connection, along with the spinning magnetic
  material that permanently stores the data. As more and more of the
  information in the world becomes digitally available, and more and more of
  our daily activities are recorded and stored, people are increasingly finding
  value in analyzing, rather than simply storing and forgetting, these large
  masses of data. Sadly, advances in I/O performance have lagged the
  development of commodity processor and memory technology, putting pressure on
  systems to deliver data fast enough for these types of data-intensive
  analysis. This dissertation proposes a system called Active Disks that takes
  advantage of the processing power on individual disk drives to run
  application-level code. Moving portions of an application's processing
  directly to the disk drives can dramatically reduce data traffic and take
  advantage of the parallelism already present in large storage systems. It
  provides a new point of leverage to overcome the I/O bottleneck. \par This
  dissertation presents the factors that will make Active Disks a reality in
  the not-so-distant future, the characteristics of applications that will
  benefit from this technology, an analysis of the improved performance and
  efficiency of systems built around Active Disks, and a discussion of some of
  the optimizations that are possible with more knowledge available directly at
  the devices. It also compares this work with previous work on database
  machines and examines the opportunities that allow us to take advantage of
  these promises today where previous approaches have not succeeded. The
  analysis is motivated by a set of applications from data mining, multimedia,
  and databases and is performed in the context of a prototype Active Disk
  system that shows dramatic speedups over a system with traditional, "dumb"
  disks.}
}

@Unpublished{riesen:experience,
  author = {Rolf Riesen and Arthur B. Maccabe and Stephen R. Wheat},
  title = {Experience in Implementing a Parallel File System},
  year = {1993},
  month = {March},
  note = {Available for ftp?},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  comment = {They describe their experience building a file system for SUNMOS.
  Paper describes tuning the SCSI device, their striping strategy, their
  message-passing tricks, and some performance results.}
}

@Article{rochberg:ctip,
  author = {David Rochberg and Garth Gibson},
  title = {Prefetching Over a Network: Early Experience with {CTIP}},
  journal = {ACM SIGMETRICS Performance Evaluation Review},
  year = {1997},
  month = {December},
  volume = {25},
  number = {3},
  pages = {29--36},
  URL = {http://doi.acm.org/10.1145/270900.270906},
  keywords = {file prefetching, distributed file system, parallel I/O,
  pario-bib},
  comment = {Part of a special issue on parallel and distributed I/O.}
}

@InProceedings{rodriguez:nnt,
  author = {Bernardo Rodriguez and Leslie Hart and Tom Henderson},
  title = {Programming Regular Grid-Based Weather Simulation Models for
  Portable and Fast Execution},
  booktitle = {Proceedings of the 1995 International Conference on Parallel
  Processing},
  year = {1995},
  month = {August},
  pages = {III:51--59},
  publisher = {CRC Press},
  address = {St. Charles, IL},
  keywords = {weather simulation, scientific application, parallel I/O,
  pario-bib},
  comment = {Related to hart:grid.}
}

@TechReport{rosales:cds,
  author = {F. Rosales and J. Carretero and F. {P\'erez} and P. de~Miguel and
  F. {Garc\'{\i}a} and L. Alonso},
  title = {{CDS} Design: A Parallel Disk Server for Multicomputers},
  year = {1994},
  number = {FIM/83.1/DATSI/94},
  institution = {Universidad Politecnic Madrid},
  address = {Madrid, Spain},
  URL = {http://laurel.datsi.fi.upm.es/~gp/publications/datsi83.1.ps.Z},
  keywords = {multiprocessor file system, parallel I/O, pario-bib},
  abstract = {This document describes the detailed design of the CDS, one of
  the components of the Cache Coherent File System (CCFS). CCFS has three main
  components: Client File Server (CLFS), Local File Server (LFS), Concurrent
  Disk System (CDS). A CDSs is located on each disk node, to develop
  input/output functions in a per node basis. The CDS will interact with the
  microkernel drivers to execute real input/output and to manage the disk
  system. The CDS includes general services to distribute accesses to disks,
  controlling partition information, etc.},
  comment = {See carretero:*, rosales:cds, perez:clfs.}
}

@InProceedings{rosti:impact,
  author = {Emilia Rosti and Giuseppe Serazzi and Evgenia Smirni and Mark S.
  Squillante},
  title = {The Impact of {I/O} on Program Behavior and Parallel Scheduling},
  booktitle = {Proceedings of the Joint International Conference on Measurement
  and Modeling of Computer Systems},
  year = {1998},
  month = {June},
  pages = {56--65},
  publisher = {ACM Press},
  URL =
  {http://www.acm.org/pubs/citations/proceedings/metrics/277851/p56-rosti/},
  keywords = {CPU scheduling, disk scheduling, I/O model, parallel I/O,
  pario-bib}
}

@InProceedings{rothnie:ksr,
  author = {James Rothnie},
  title = {{Kendall Square Research:} Introduction to the {KSR1}},
  booktitle = {Proceedings of the 1992 DAGS/PC Symposium},
  year = {1992},
  month = {June 23--27},
  pages = {200--210},
  organization = {Dartmouth Institute for Advanced Graduate Studies},
  address = {Hanover, NH},
  keywords = {parallel architecture, shared memory, MIMD, interconnection
  network, parallel I/O, memory-mapped files, pario-bib},
  comment = {Overview of the KSR1.}
}

@InProceedings{roy:unixfile,
  author = {Paul J. Roy},
  title = {Unix File Access and Caching in a Multicomputer Environment},
  booktitle = {Proceedings of the Usenix {Mach III} Symposium},
  year = {1993},
  pages = {21--37},
  keywords = {multiprocessor file system, Unix, Mach, memory mapped file,
  pario-bib},
  comment = {Describes the modifications to the OSF/1 AD file system for a
  multicomputer environment. Goal is for normal Unix files, not supercomputer
  access. The big thing was separation of the caching from backing store
  management, by pulling out the cache management into the Extended Memory
  Management (XMM) subsystem. Normally OSF/1 maps files to Mach memory objects,
  which are then accessed (through read() and write()) using bcopy(). XMM makes
  it possible to access these memory objects from any node in the system,
  providing coherent compute-node caching of pages from the memory object. It
  uses tokens controlled by the XMM server at the file's server node to support
  a single-reader, single-writer policy on the whole file, but migrating page
  by page. They plan to extend to multiple writers, but atomicity constraints
  on the file pointer and metadata make it difficult. Files are NOT striped
  across file servers or I/O nodes. Several hacks were necessary to work around
  Mach interface problems. Unix buffer caching is abandoned. Future includes
  supercomputer support in the form of turning off all caching. No performance
  evaluation included. See zajcew:osf1.}
}

@Unpublished{rullman:interface,
  author = {Brad Rullman and David Payne},
  title = {An Efficient File {I/O} Interface for Parallel Applications},
  year = {1995},
  month = {February},
  note = {DRAFT presented at the Workshop on Scalable I/O, Frontiers~'95},
  keywords = {parallel I/O, multiprocessor file system interface, pario-bib},
  comment = {They believe that the API should be Unix-compatible, systems must
  support scalable performance on large transfers of data, and that systems
  must support very large files. Most of the paper is specifics about the
  Paragon PFS interface, which has many features not mentioned in earlier PFS
  papers. Contact brad@ssd.intel.com or payne@ssd.intel.com.}
}

@Misc{ryan:cfs,
  author = {Steve Ryan},
  title = {{CFS} workload demonstration code},
  year = {1991},
  month = {July},
  howpublished = {WWW
  ftp://ftp.cs.dartmouth.edu/pub/pario/examples/CFS3D.tar.Z},
  note = {A simple program demonstrating CFS usage for ARC3D-like
  applications},
  URL = {ftp://ftp.cs.dartmouth.edu/pub/pario/examples/CFS3D.tar.Z},
  keywords = {parallel I/O workload, file access pattern, Intel, pario-bib},
  comment = {A sample code that tries to behave like a parallel ARC3D in terms
  of its output. It writes two files, one containing three three-dimensional
  matrices X, Y, and Z, and the other containing the four-dimensional matrix Q.
  The matrices are spread over all the nodes, and each file is written in
  parallel by the processors. See also ryan:navier.}
}

@InProceedings{ryan:navier,
  author = {J. S. Ryan and S. K. Weeratunga},
  title = {Parallel Computation of {3-D Navier-Stokes} Flowfields for
  Supersonic Vehicles},
  booktitle = {31st Aerospace Sciences Meeting and Exhibit},
  year = {1993},
  address = {Reno, NV},
  note = {AIAA Paper 93-0064},
  keywords = {parallel application, CFD, parallel I/O, pario-bib},
  comment = {This paper goes with the ryan:cfs code example. Describes their
  parallel implementation of the ARC3D code on the iPSC/860. A section of the
  paper considers I/O, which is to write out a large multidimensional matrix at
  each timestep. They found that it was actually faster to write to separate
  files because of congestion in the I/O nodes was hurting performance. They
  never got more than 2 MB/s, even so, on a system that should obtain 7-10 MB/s
  peak.}
}

@InProceedings{salem:diskstripe,
  author = {Kenneth Salem and Hector Garcia-Molina},
  title = {Disk Striping},
  booktitle = {Proceedings of the IEEE 1986 Conference on Data Engineering},
  year = {1986},
  pages = {336--342},
  earlier = {salem:striping},
  keywords = {parallel I/O, disk striping, disk array, pario-bib},
  comment = {See the techreport salem:striping for a nearly identical but more
  detailed version.}
}

@TechReport{salem:striping,
  author = {Kenneth Salem and Hector Garcia-Molina},
  title = {Disk Striping},
  year = {1984},
  month = {December},
  number = {332},
  institution = {EECS Dept. Princeton Univ.},
  later = {salem:disktripe},
  keywords = {parallel I/O, disk striping, disk array, pario-bib},
  comment = {Cite salem:diskstripe instead. Basic paper on striping. For
  uniprocessor, single-user machine. Interleaving asynchronous, even without
  matching disk locations though this is discussed. All done with models.}
}

@InProceedings{salmon:cubix,
  author = {John Salmon},
  title = {{CUBIX: Programming} Hypercubes without Programming Hosts},
  booktitle = {Proceedings of the Second Conference on Hypercube
  Multiprocessors},
  year = {1986},
  pages = {3--9},
  keywords = {hypercube, multiprocessor file system interface, pario-bib},
  comment = {Previously, hypercubes were programmed as a combination of host
  and node programs. Salmon proposes to use a universal host program that acts
  essentially as a file server, responding to requests from the node programs.
  Two modes: crystalline, where node programs run in loose synchrony, and
  amorphous, where node programs are asynchronous. In the crystalline case,
  files have a single file pointer and are either single- or multiple- access;
  single access means all nodes must simultaneously issue the same request;
  multiple access means they all simultaneously issue the same request with
  different parameters, giving an interleaved pattern. Amorphous allows
  asynchronous activity, with separate file pointers per node.}
}

@InProceedings{salmon:nbody,
  author = {John Salmon and Michael Warren},
  title = {Parallel Out-of-core Methods for {N}-body Simulation},
  booktitle = {Proceedings of the Eighth SIAM Conference on Parallel Processing
  for Scientific Computing},
  year = {1997},
  month = {March},
  URL = {http://www.cacr.caltech.edu/~johns/pubs/siam97/},
  keywords = {parallel I/O, out of core applications, scientific computing,
  pario-bib},
  abstract = {Hierarchical treecodes have, to a large extent, converted the
  compute-bound N-body problem into a memory-bound problem. The large ratio of
  DRAM to disk pricing suggests use of out-of-core techniques to overcome
  memory capacity limitations. We will describe a parallel, out-of-core
  treecode library, targeted at machines with independent secondary storage
  associated with each processor. Borrowing the space-filling curve techniques
  from our in-core library, and ``manually'' paging, results in excellent
  spatial and temporal locality and very good performance.},
  comment = {Only published on CD-ROM}
}

@InProceedings{sanders:async,
  author = {Peter Sanders},
  title = {Asynchronous scheduling of redundant disk arrays.},
  booktitle = {Twelfth ACM Symposium on Parallel Algorithms and Architectures},
  year = {2000},
  month = {July},
  pages = {89--98},
  publisher = {ACM Press},
  copyright = {(c)2004 IEE},
  address = {Bar Harbour, MN, USA},
  later = {#sanders:jasync#},
  keywords = {parallel disks, lazy scheduling, random redundant storage, I/O
  algorithm, random block placement, bipartite matching, pario-bib},
  abstract = {Random redundant allocation of data to parallel disk arrays can
  be exploited to achieve low access delays. New algorithms are proposed which
  improve the previously known shortest queue algorithm by systematically
  exploiting that scheduling decisions can be deferred until a block access is
  actually started on a disk. These algorithms are also generalized for coding
  schemes with low redundancy. Using extensive experiments, practically
  important quantities are measured which have so far eluded an analytical
  treatment: the delay distribution when a stream of requests approaches the
  limit of the system capacity, the system efficiency for parallel disk
  applications with bounded prefetching buffers, and the combination of both
  for mixed traffic. A further step towards practice is taken by outlining the
  system design for alpha : automatically load-balanced parallel hard-disk
  array. (31 refs.)},
  comment = {Also see later version sanders:jasync.}
}

@TechReport{sanders:datatypes,
  author = {Darren Sanders and Yoonho Park and Maciej Brodowicz},
  title = {Implementation and performance of {MPI-IO} file access using {MPI}
  datatypes},
  year = {1996},
  month = {November},
  number = {UH-CS-96-12},
  institution = {University of Houston},
  URL = {http://www.hpc.uh.edu/cenju/pub/mpio.ps},
  keywords = {parallel I/O, multiprocessor file system interface, pario-bib},
  abstract = {In this paper we document our experience implementing MPI-IO file
  access using MPI datatypes. We present performance results and discuss two
  significant problems that stem from the flexibility of MPI datatypes. First,
  MPI datatypes can be used to specify non-contiguous access patterns.
  Optimizing data transfers for such patterns is difficult. Second, the
  behavior of MPI datatypes in a heterogenous environment is not
  well-defined.},
  comment = {They devise several file-access strategies for different
  situations, depending on the particulars of the etypes and filetypes in use:
  sequential, two-phase I/O, one file access per etype (random access), and one
  file access per etype element (random access with smaller pieces). They
  measure the performance of their system with example patterns that trigger
  each strategy. It would be nice to see a more extensive performance analysis
  of their implementation, and of their strategies.}
}

@Article{sanders:jasync,
  author = {Peter Sanders},
  title = {Asynchronous Scheduling of Redundant Disk Arrays},
  journal = {IEEE Transactions on Computers},
  year = {2003},
  month = {September},
  volume = {52},
  number = {9},
  pages = {1170--1184},
  institution = {Max Planck Inst Comp Sci, Stuhlsatzenhausweg 85, D-66123
  Saarbrucken, Germany; Max Planck Inst Comp Sci, D-66123 Saarbrucken,
  Germany},
  publisher = {IEEE Computer Society Press},
  URL = {http://csdl.computer.org/dl/trans/tc/2003/09/t1170.pdf},
  keywords = {parallel disks, lazy scheduling, random redundant storage, I/O
  algorithm, random block placement, bipartite matching, pario-bib},
  abstract = {Allocation of data to a parallel disk using redundant storage and
  random placement of blocks can be exploited to achieve low access delays. New
  algorithms are proposed which improve the previously known shortest queue
  algorithm by systematically exploiting the fact that scheduling decisions can
  be deferred until a block access is actually started on a disk. These
  algorithms are also generalized for coding schemes with low redundancy. Using
  extensive simulations, practically important quantities are measured which
  have so far eluded an analytical treatment: The delay distribution when a
  stream of requests approaches the limit of the system capacity, the system
  efficiency for parallel disk applications with bounded prefetching buffers,
  and the combination of both for mixed traffic. A further step toward practice
  is taken by outlining the system design for a.: automatically load-balanced
  parallel hard-disk array. Additional algorithmic measures are proposed for a
  that allow variable sized blocks, seek time reduction, fault tolerance,
  inhomogeneous systems, and flexible priorization schemes. (41 refs.)}
}

@Article{sanders:models,
  author = {P. Sanders},
  title = {Reconciling simplicity and realism in parallel disk models},
  journal = {Parallel Computing},
  year = {2002},
  month = {May},
  volume = {28},
  number = {5},
  pages = {705--723},
  publisher = {Elsevier Science},
  URL = {http://www.elsevier.com/gej-ng/10/35/21/60/57/29/abstract.html},
  keywords = {parallel I/O, pario-bib},
  abstract = {For the design and analysis of algorithms that process huge data
  sets, a machine model is needed that handles parallel disks. There seems to
  be a dilemma between simple and flexible use of such a model and accurate
  modeling of details of the hardware. This paper explains how many aspects of
  this problem can be resolved. The programming model implements one large
  logical disk allowing concurrent access to arbitrary sets of variable size
  blocks. This model can be implemented efficiently on multiple independent
  disks even if zones with different speed, communication bottlenecks and
  failed disks are allowed. These results not only provide useful algorithmic
  tools but also imply a theoretical justification for studying external memory
  algorithms using simple abstract models. The algorithmic approach is random
  redundant placement of data and optimal scheduling of accesses. The analysis
  generalizes a previous analysis for simple abstract external memory models in
  several ways (higher efficiency, variable block sizes, more detailed disk
  model).}
}

@InProceedings{savage:afraid,
  author = {Stefan Savage and John Wilkes},
  title = {{AFRAID}--- A Frequently Redundant Array of Independent Disks},
  booktitle = {Proceedings of the 1996 USENIX Technical Conference},
  year = {1996},
  month = {January},
  pages = {27--39},
  URL = {http://www.hpl.hp.com/personal/John_Wilkes/papers/AFRAID.ps.Z},
  keywords = {RAID, disk array, parallel I/O, pario-bib},
  comment = {RAID array that relaxes the consistency requirements, to not write
  parity during busy periods, then to go back and update parity during idle
  periods. Thus you sacrifice a little reliability for performance; you can
  select how much.}
}

@TechReport{scheuermann:partition,
  author = {Peter Scheuermann and Gerhard Weikum and Peter Zabback},
  title = {Data Partitioning and Load Balancing in Parallel Disk Systems},
  year = {1994},
  month = {January},
  number = {209},
  institution = {ETH Zurich},
  later = {scheuermann:partition2},
  keywords = {parallel I/O, disk array, disk striping, load balance,
  pario-bib},
  comment = {Updated as scheuermann:partition2. They describe a file system
  that attempts to choose both the degree of declustering and the striping unit
  size to accomodate the needs of different files. They also decsribe static
  and dynamic placement and migration policies to readjust the load across
  disks. Note that there are several references in the bib that are about their
  file system, called FIVE. Seems to be the same as scheuermann:tunable.}
}

@Article{scheuermann:partition2,
  author = {Peter Scheuermann and Gerhard Weikum and Peter Zabback},
  title = {Data Partitioning and Load Balancing in Parallel Disk Systems},
  journal = {The VLDB Journal},
  year = {1998},
  month = {February},
  volume = {7},
  number = {1},
  pages = {48--66},
  earlier = {scheuermann:partition},
  URL =
  {http://link.springer.de/link/service/journals/00778/papers/8007001/80070048.ps.gz},
  keywords = {parallel I/O, disk array, disk striping, load balance,
  pario-bib},
  abstract = {Parallel disk systems provide opportunities for exploiting I/O
  parallelism in two possible ways, namely via inter-request and intra-request
  parallelism. In this paper, we discuss the main issues in performance tuning
  of such systems, namely striping and load balancing, and show their
  relationship to response time and throughput. We outline the main components
  of an intelligent, self-reliant file system that aims to optimize striping by
  taking into account the requirements of the applications, and performs load
  balancing by judicious file allocation and dynamic redistributions of the
  data when access patterns change. Our system uses simple but effective
  heuristics that incur only little overhead. We present performance
  experiments based on synthetic workloads and real-life traces.},
  comment = {Updated version of scheuermann:partition.}
}

@Unpublished{scheuermann:tunable,
  author = {Peter Scheuermann and Gerhard Weikum and Peter Zabback},
  title = {The Case for Tunable Disk Arrays},
  year = {1993},
  note = {Publication status unknown.},
  keywords = {parallel I/O, disk array, disk striping, pario-bib},
  comment = {Seems to be the same as scheuermann:partition.}
}

@InCollection{schikuta:bookchap,
  author = {Erich Schikuta and Heinz Stokinger},
  title = {Parallel {I/O} for Clusters: Methodologies and Systems},
  booktitle = {High Peformance Cluster Computing},
  editor = {Rajkumar Buyya},
  year = {1999},
  pages = {439--462},
  publisher = {Prentice Hall PTR},
  keywords = {parallel file system, cluster computing, parallel I/O, pario-bib}
}

@Article{schikuta:pario,
  author = {Erich Schikuta and Helmut Wanek},
  title = {Parallel {I/O}},
  journal = {International Journal of High Performance Computing Applications},
  year = {2001},
  month = {Summer},
  volume = {15},
  number = {2},
  pages = {162--168},
  publisher = {Sage Publications},
  keywords = {parallel I/O, pario-bib},
  comment = {A brief overview of issues in parallel I/O, and a short case study
  of the data-intensive computational grid at CERN.}
}

@InProceedings{schloss:hcsa,
  author = {Gary Schloss and Michael Vernick},
  title = {{HCSA:} A Hybrid Client-Server Architecture},
  booktitle = {Proceedings of the IPPS~'95 Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1995},
  month = {April},
  pages = {63--77},
  later = {schloss:hcsa-book},
  URL = {http://www.cs.sunysb.edu/~vernick/Papers/iopads.ps},
  keywords = {parallel I/O, pario-bib},
  comment = {In the context of client-server database systems, they propose to
  make a compromise between shared-disk architectures, where the disks are all
  attached to the network and all machines are both clients and servers, and a
  system where the disks are attached to a single server. Their compromise
  attaches the disks to both the network and the server.}
}

@InCollection{schloss:hcsa-book,
  author = {Gerhard A. Schloss and Michael Vernick},
  title = {{HCSA}: A Hybrid Client-Server Architecture},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {15},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {333--351},
  publisher = {Kluwer Academic Publishers},
  earlier = {schloss:hcsa},
  keywords = {parallel I/O architecture, pario-bib},
  abstract = {The {\em HCSA} (Hybrid Client-Server Architecture), a flexible
  system layout that combines the advantages of the traditional Client-Server
  Architecture (CSA) with those of the Shared Disk Architecture (SDA), is
  introduced. In {\em HCSA}, the traditional CSA-style I/O subsystem is
  modified to give the clients network access to both the server and the
  server's set of disks. Hence, the {\em HCSA} is more fault-tolerant than the
  CSA since there are two paths between any client and the shared data.
  Moreover, a simulation study demonstrates that the {\em HCSA} is able to
  support a larger number of clients than the CSA or SDA under similar system
  workloads. Finally, the {\em HCSA} can run applications in either a CSA mode,
  an SDA mode, or a combination of the two, thus offering backward
  compatibility with a large number of existing applications.},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@Misc{schneider:sp2-io,
  author = {David Schneider},
  title = {Application {I/O} and Related Issues on the {SP2}},
  year = {1995},
  organization = {Cornell Theory Center, Cornell University},
  note = {Available at
  \verb+http://www.tc.cornell.edu/SmartNodes/Newsletters/1994/V6N5/application.html+},
  URL =
  {http://www.tc.cornell.edu/SmartNodes/Newsletters/1994/V6N5/application.html},
  keywords = {parallel I/O, IBM SP-2, pario-bib}
}

@InProceedings{schulz:semantic,
  author = {Martin Schulz and Daniel A. Reed},
  title = {Using Semantic Information to Guide Efficient I/O on Clusters},
  booktitle = {Proceedings of the Eleventh IEEE International Symposium on High
  Performance Distributed Computing},
  year = {2002},
  pages = {135--142},
  publisher = {IEEE Computer Society Press},
  address = {Edinburgh, Scotland},
  keywords = {I/O, data distribution, medical imaging application, parallel
  I/O, pario-bib},
  comment = {The paper describes DIOM (Distributed I/O management), a system to
  manage data distributed to local disks of a cluster of workstations. The
  distribution process uses semantic information from both the data set and the
  application to decide how to distribute the data. The data is stored using a
  self-describing format (similar to HDF). The description of the data is
  either stored in a file header, or it is part of a central repository (format
  identified by file suffix). DIOM decides how to distribute the data based on
  the application-supplied splitting-pattern , of which there are three types:
  single (copy all data to a single node), block (divide data evenly between
  the nodes), round (stripe blocks in a round-robin fashion). Parameters such
  as stripe size, initial node, etc, are defined by the app.}
}

@TechReport{schulze:raid,
  author = {Martin Schulze},
  title = {Considerations in the Design of a {RAID} Prototype},
  year = {1988},
  month = {August},
  number = {UCB/CSD 88/448},
  institution = {UC Berkeley},
  URL = {http://cs-tr.cs.berkeley.edu/TR/UCB:CSD-88-448},
  keywords = {parallel I/O, RAID, disk array, disk architecture, pario-bib},
  comment = {Very practical description of the RAID I prototype.}
}

@InProceedings{schulze:raid2,
  author = {Martin Schulze and Garth Gibson and Randy Katz and David
  Patterson},
  title = {How Reliable is a {RAID}?},
  booktitle = {Proceedings of IEEE Compcon},
  year = {1989},
  month = {Spring},
  earlier = {chen:raid},
  keywords = {parallel I/O, reliability, RAID, disk array, disk architecture,
  pario-bib},
  comment = {Published version of second paper in chen:raid. Some overlap with
  schulze:raid, though that paper has more detail.}
}

@InProceedings{schwabe:flexible,
  author = {Eric J. Schwabe and Ian M. Sutherland},
  title = {Flexible Use of Parity Storage Space in Disk Arrays},
  booktitle = {Proceedings of the Eighth Symposium on Parallel Algorithms and
  Architectures},
  year = {1996},
  month = {June},
  pages = {99--108},
  publisher = {ACM Press},
  address = {Padua, Italy},
  keywords = {parallel disks, disk array, parity, RAID, pario-bib}
}

@Article{schwabe:jlayouts,
  author = {Eric J. Schwabe and Ian M. Sutherland and Bruce K. Holmer},
  title = {Evaluating Approximately Balanced Parity-Declustered Data Layouts
  for Disk Arrays},
  journal = {Parallel Computing},
  year = {1997},
  month = {June},
  volume = {23},
  number = {4},
  pages = {501--523},
  publisher = {North-Holland (Elsevier Scientific)},
  earlier = {schwabe:layouts},
  keywords = {disk array, parity, RAID, parallel I/O, pario-bib},
  abstract = {Parity-declustered data layouts were developed to reduce the time
  for on-line failure recovery in disk arrays. They generally require perfect
  balancing of reconstruction workload among the disks; this restrictive
  balance condition makes such data layouts difficult to construct. In this
  paper, we consider approximately balanced data layouts, where some variation
  in the reconstruction workload over the disks is permitted. Such layouts are
  considerably easier to construct than perfectly balanced layouts. We consider
  three methods for constructing approximately balanced data layouts, and
  analyze their performance both theoretically and experimentally. We conclude
  that on uniform workloads, approximately balanced layouts have performance
  nearly identical to that of perfectly balanced layouts.}
}

@InProceedings{schwabe:layouts,
  author = {Eric J. Schwabe and Ian M. Sutherland and Bruce K. Holmer},
  title = {Evaluating Approximately Balanced Parity-Declustered Data Layouts
  for Disk Arrays},
  booktitle = {Proceedings of the Fourth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1996},
  month = {May},
  pages = {41--54},
  publisher = {ACM Press},
  address = {Philadelphia},
  later = {schwabe:jlayouts},
  keywords = {parallel I/O, disk array, parity, RAID, pario-bib},
  abstract = {Parity declustering has been used to reduce the time required to
  reconstruct a failed disk in a disk array. Most existing work on parity
  declustering uses BIBD-based data layouts, which distribute the workload of
  reconstructing a failed disk over the remaining disks of the array with
  perfect balance. For certain array sizes, however, there is no known
  BIBD-based layout. In this paper, we evaluate data layouts that are
  approximately balanced --- that is, that distribute the reconstruction
  workload over the disks of the array with only approximate balance.
  Approximately balanced layouts are considerably easier to construct than
  perfectly balanced layouts. We consider three methods for generating
  approximately balanced layouts: randomization, simulated annealing, and
  perturbing a BIBD-based layout whose size is near the desired size. We
  compare the performance of these approximately balanced layouts with that of
  perfectly balanced layouts using a disk array simulator. We conclude that, on
  uniform workloads, approximately balanced data layouts have performance
  nearly identical to that of perfectly balanced layouts. Approximately
  balanced layouts therefore provide the reconstruction performance benefits of
  perfectly balanced layouts for arrays where perfectly balanced layouts are
  either not known, or do not exist.}
}

@InProceedings{scott:matrix,
  author = {David S. Scott},
  title = {Parallel {I/O} and Solving Out of Core Systems of Linear Equations},
  booktitle = {Proceedings of the 1993 DAGS/PC Symposium},
  year = {1993},
  month = {June},
  pages = {123--130},
  organization = {Dartmouth Institute for Advanced Graduate Studies},
  address = {Hanover, NH},
  keywords = {parallel I/O, scientific computing, matrix factorization, Intel,
  pario-bib},
  abstract = {Large systems of linear equations arise in a number of scientific
  and engineering applications. In this paper we describe the implementation of
  a family of disk based linear equation solvers and the required
  characteristics of the I/O system needed to support them.},
  comment = {Invited speaker. See also scott:solvers. This gives a very brief
  overview of Intel's block solver and slab solver, both out-of-core
  linear-systems solvers. He notes a few optimizations that had to be made to
  CFS to make it work: data and metadata needed to have equal priority in the
  cache, because often the (higher-priority) metadata was crowding out the
  data; and they had to restrict some files to small subsets of disks to reduce
  the contention for the cache at each I/O node caused by large groups of
  processors all requesting at the same time (see nitzberg:cfs for the same
  problem).}
}

@InProceedings{scott:solvers,
  author = {David S. Scott},
  title = {Out of Core Dense Solvers on {Intel} Parallel Supercomputers},
  booktitle = {Proceedings of the Fourth Symposium on the Frontiers of
  Massively Parallel Computation},
  year = {1992},
  pages = {484--487},
  keywords = {parallel I/O, scientific computing, Intel, pario-bib},
  comment = {He discusses ProSolver-DES, which factors large matrices by
  swapping square submatrices in and out of memory, and Intel's new solver,
  which swaps column blocks in and out. The new solver is a little slower, but
  allows full pivoting, which is needed for stability in some matrices. A short
  paper with little detail. Some performance numbers. See scott:matrix.}
}

@InProceedings{seamons:compressed,
  author = {K. E. Seamons and M. Winslett},
  title = {A Data Management Approach for Handling Large Compressed Arrays in
  High Performance Computing},
  booktitle = {Proceedings of the Fifth Symposium on the Frontiers of Massively
  Parallel Computation},
  year = {1995},
  month = {February},
  pages = {119--128},
  URL = {http://bunny.cs.uiuc.edu/CADR/pubs/compression.ps},
  keywords = {parallel I/O, pario-bib},
  comment = {``This paper shows how compression can be used to speed up
  parallel i/o of large arrays. The current version of the paper focuses on
  improving write performance.'' They use chunked files like in
  seamons:interface but before writing they compress each chunk on its compute
  node, and after reading they decompress each chunk on its compute node.
  Presumably this is only useful when you plan to read back whole chunks. They
  find better performance for compressing in many cases, even when the
  compression time dominates the I/O time, because it reduces the I/O time so
  much. They found that the compression time and compression ratio can vary
  widely from chunk to chunk, leading to a tremendous load imbalance that
  unfortunately spoils some of the advanatages if all compute nodes must wait
  for the slowest to finish.}
}

@InProceedings{seamons:interface,
  author = {K. E. Seamons and M. Winslett},
  title = {An Efficient Abstract Interface for Multidimensional Array {I/O}},
  booktitle = {Proceedings of Supercomputing '94},
  year = {1994},
  month = {November},
  pages = {650--659},
  publisher = {IEEE Computer Society Press},
  address = {Washington, DC},
  later = {seamons:jpanda},
  URL = {http://bunny.cs.uiuc.edu/CADR/pubs/super94.ps},
  keywords = {parallel I/O, multiprocessor file system interface, pario-bib},
  comment = {``This paper shows what large performance gains can be made for
  parallel i/o of large arrays by using a carefully implemented library
  interface for i/o that makes use of array chunking. For example, the authors
  obtained a factor of 10 speedup in output of time step data by using the
  natural array chunks of the problem decomposition as the units of i/o on an
  Intel iPSC/860. The paper also presents results from experiments with the use
  of chunking in checkpointing and restarts on parallel architectures, and the
  use of chunking with memory-mapped data files in visualization on sequential
  architectures.'' They describe a library that supports chunked
  representations of matrices. That is, ways to checkpoint, output, or input
  multidimensional matrices to files in a blocked rather than row-major or
  column-major layout. This helps the file be more versatile for reading in a
  variety of dimensions. Their experiments show good performance improvements,
  although they only tried it for an application whose data set in memory was
  already in a blocked distribution -- I would guess that smaller improvements
  might come from column- or row-oriented memory distributions. Also, some of
  their performance improvement came from characteristics specific to the Intel
  CFS file system, having to do with its IOP-cache management policies. See
  also seamons:schemas and seamons:compressed.}
}

@Article{seamons:jpanda,
  author = {Kent E. Seamons and Marianne Winslett},
  title = {Multidimensional Array {I/O} in {Panda~1.0}},
  journal = {Journal of Supercomputing},
  year = {1996},
  volume = {10},
  number = {2},
  pages = {191--211},
  earlier = {seamons:interface},
  keywords = {parallel I/O, collective I/O, pario-bib}
}

@Misc{seamons:msio,
  author = {K. E. Seamons and Y. Chen and M. Winslett and Y. Cho and S. Kuo and
  P. Jones and J. Jozwiak and M. Subramanian},
  title = {Fast and Easy {I/O} for Arrays in Large-scale Applications},
  booktitle = {Workshop on Modeling and Specification of I/O},
  year = {1995},
  month = {October},
  note = {At SPDP'95},
  URL = {http://bunny.cs.uiuc.edu/CDR/pubs/msio95.html},
  keywords = {parallel I/O, scientific computing, pario-bib},
  abstract = {This four-page paper, written for an audience from the
  supercomputing/parallel i/o community, is a nice succinct introduction to
  Panda. Abstract and summary: \par Scientists with high-performance computing
  needs are plagued by applications suffering poor i/o performance and are
  burdened with the need to consider low-level physical storage details of
  persistent arrays in order to reach acceptable i/o performance levels,
  especially with existing parallel i/o facilities. The Panda i/o library (URL
  http://bunny.cs.uiuc.edu/CADR/panda.html) serves as a concrete example of a
  methodology for freeing application developers from unnecessary storage
  details through high-level abstract interfaces and providing them with
  increased performance and greater portability. \par Panda addresses these
  problems by introducing high-level application program interfaces for array
  i/o on both parallel and sequential machines, and by developing an efficient
  commodity-parts-based implementation of those interfaces across a variety of
  computer architectures. It is costly to build a file system from scratch and
  we designed Panda to run on top of existing commodity file systems such as
  AIX; excellent performance using this approach implies immediate and broad
  applicability. High-level interfaces provide ease of use, application
  portability, and, most importantly, allow plenty of flexibility for an
  efficient underlying implementation. A high-level view of an entire i/o
  operation, made possible with Panda's high level interfaces, allows Panda to
  optimize reading and writing arrays to the host file system on the i/o nodes
  using Panda's server-directed i/o architecture. \par Panda focuses
  specifically on multidimensional arrays, the data type at the root of i/o
  performance problems in scientific computing. The Panda i/o library exhibits
  excellent performance on the NASA Ames NAS IBM SP2, attaining 83--98\% of
  peak AIX performance on each i/o node in the experiments described in this
  paper. We expect high-level interfaces such as Panda's to become the
  interfaces of choice for scientific applications in the future. As Panda can
  be easily added on top of existing parallel file systems and ordinary file
  systems without changing them, Panda illustrates a way to obtain cheap, fast,
  and easy-to-use i/o for high-performance scientific applications.},
  comment = {Just a short 4-page summary of the Panda I/O library, including
  some brief performance results.}
}

@InProceedings{seamons:panda,
  author = {K. E. Seamons and Y. Chen and P. Jones and J. Jozwiak and M.
  Winslett},
  title = {Server-Directed Collective {I/O} in {Panda}},
  booktitle = {Proceedings of Supercomputing '95},
  year = {1995},
  month = {December},
  publisher = {IEEE Computer Society Press},
  address = {San Diego, CA},
  URL = {http://www.supercomp.org/sc95/proceedings/520_SEAM/SC95.HTM},
  keywords = {collective I/O, parallel I/O, pario-bib},
  abstract = {We present the architecture and implementation results for Panda
  2.0, a library for input and output of multidimensional arrays on parallel
  and sequential platforms. Panda achieves remarkable performance levels on the
  IBM SP2, showing excellent scalability as data size increases and as the
  number of nodes increases, and provides throughputs close to the full
  capacity of the AIX file system on the SP2 we used. We argue that this good
  performance can be traced to Panda's use of server-directed i/o (a
  logical-level version of disk-directed i/o [Kotz94b]) to perform array i/o
  using sequential disk reads and writes, a very high level interface for
  collective i/o requests, and built-in facilities for arbitrary rearrangements
  of arrays during i/o. Other advantages of Panda's approach are ease of use,
  easy application portability, and a reliance on commodity system software.},
  comment = {This rewrite of Panda (see seamons:interface) is in C++ and runs
  on the SP2. They provide simple ways to declare the distribution of your
  array in memory and on disk, to form a list of arrays to be output at each
  timestep or at each checkpoint, and then to call for a timestep or
  checkpoint. Then they use something like disk-directed I/O (kotz:jdiskdir)
  internally to accomplish the rearrangement and transfer of data from compute
  nodes to I/O nodes. Note proceedings only on CD-ROM and WWW.}
}

@InProceedings{seamons:schemas,
  author = {K. E. Seamons and M. Winslett},
  title = {Physical Schemas for Large Multidimensional Arrays in Scientific
  Computing Applications},
  booktitle = {Proceedings of the 7th International Working Conference on
  Scientific and Statistical Database Management},
  year = {1994},
  month = {September},
  pages = {218--227},
  URL = {http://bunny.cs.uiuc.edu/CADR/pubs/ssdbm.ps},
  keywords = {parallel I/O, scientific database, scientific computing,
  pario-bib},
  comment = {``This paper presents PANDA's high-level interfaces for i/o
  operations, including checkpoint, restart, and time step output, and explains
  the rationale behind them.'' Basically they provide a bit of detail for the
  file formats they use in seamons:interface}
}

@PhdThesis{seamons:thesis,
  author = {Kent E. Seamons},
  title = {Panda: Fast Access to Persistent Arrays Using High Level Interfaces
  and Server Directed Input/Output},
  year = {1996},
  month = {May},
  school = {University of Illinois at Urbana-Champaign},
  URL = {http://bunny.cs.uiuc.edu/CADR/pubs/seamons-thesis.html},
  keywords = {parallel I/O, persistent data, parallel computing, pario-bib},
  abstract = {Multidimensional arrays are a fundamental data type in scientific
  computing and are used extensively across a broad range of applications.
  Often these arrays are persistent, i.e., they outlive the invocation of the
  program that created them. Portability and performance with respect to input
  and output (i/o) pose significant challenges to applications accessing large
  persistent arrays, especially in distributed-memory environments. A
  significant number of scientific applications perform conceptually simple
  array i/o operations, such as reading or writing a subarray, an entire array,
  or a list of arrays. However, the algorithms to perform these operations
  efficiently on a given platform may be complex and non-portable, and may
  require costly customizations to operating system software. \par This thesis
  presents a high-level interface for array i/o and three implementation
  architectures, embodied in the Panda (Persistence AND Arrays) array i/o
  library. The high-level interface contributes to application portability, by
  encapsulating unnecessary details and being easy to use. Performance results
  using Panda demonstrate that an i/o system can provide application programs
  with a high-level, portable, easy-to-use interface for array i/o without
  sacrificing performance or requiring custom system software; in fact,
  combining all these benefits may only be possible through a high-level
  interface due to the great freedom and flexibility a high-level interface
  provides for the underlying implementation. \par The Panda server-directed
  i/o architecture is a prime example of an efficient implementation of
  collective array i/o for closely synchronized applications in
  distributed-memory single-program multiple-data (SPMD) environments. A
  high-level interface is instrumental to the good performance of
  server-directed i/o, since it provides a global view of an upcoming
  collective i/o operation that Panda uses to plan sequential reads and writes.
  Performance results show that with server-directed i/o, Panda achieves
  throughputs close to the maximum AIX file system throughput on the i/o nodes
  of the IBM SP2 when reading and writing large multidimensional arrays.},
  comment = {see also chen:panda, seamons:panda, seamons:compressed,
  seamons:interface, seamons:schemas, seamons:msio, seamons:jpanda}
}

@InProceedings{segawa:pvfs-pm,
  author = {Koji Segawa and Osamu Tatebe and Yuetsu Kodama and Tomohiro Kudoh
  and Toshiyuki Shimizu},
  title = {Design and implementation of {PVFS-PM}: a cluster file system on
  {SC}ore},
  booktitle = {Workshop on Parallel I/O in Cluster Computing and Computational
  Grids},
  year = {2003},
  month = {May},
  pages = {705--711},
  organization = {National Institute of Advanced Industrial Science and
  Technology (AIST)},
  publisher = {IEEE Computer Society Press},
  address = {Tokyo},
  note = {Organized at the IEEE/ACM International Symposium on Cluster
  Computing and the Grid 2003},
  URL =
  {http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190705abs.htm},
  keywords = {parallel I/O, pario-bib},
  abstract = {This paper discusses the design and implementation of a cluster
  file system, called PVFS-PM, on the SCore cluster system software. This is
  the first attempt to implement a cluster file system on the SCore system. It
  is based on the PVFS cluster file system but replaces TCP with the PMv2
  communication library supported by SCore to provide a scalable,
  high-performance cluster file system. PVFS-PM improves the performance by
  factors of 1.07 and 1.93 for writing and reading,respectively, with 8 I/O
  nodes, compared with the original PVFS on TCP on a Gigabit Ethernet-connected
  SCore cluster.}
}

@InProceedings{shah:algorithms,
  author = {Rahul Shah and Peter J. Varman and Jeffrey Scott Vitter},
  title = {Online algorithms for prefetching and caching on parallel disks},
  journal = {Annual ACM Symposium on Parallel Algorithms and Architectures},
  booktitle = {Proceedings of the Sixteenth Symposium on Parallel Algorithms
  and Architectures},
  year = {2004},
  month = {June},
  volume = {16},
  pages = {255--264},
  copyright = {(c)2004 Elsevier Engineering Information, Inc.},
  howpublished = {SPAA 2004 - Sixteenth Annual ACM Symposium on Parallelism in
  Algorithms and Architectures; 2004; v.16; p.255-264},
  address = {Barcelona, Spain},
  URL = {http://doi.acm.org/10.1145/1007912.1007950},
  keywords = {online algorithms, prefetching, caching, parallel disk model,
  threshold LRU, pario-bib},
  abstract = {Parallel disks provide a cost effective way of speeding up I/Os
  in applications that work with large amounts of data. The main challenge is
  to achieve as much parallelism as possible, using prefetching to avoid
  bottlenecks in disk access. Efficient algorithms have been developed for some
  particular patterns of accessing the disk blocks, In this paper, we consider
  general request sequences. When the request sequence consists of unique block
  requests, the problem is called prefetching and is a well-solved problem for
  arbitrary request sequences. When the reference sequence can have repeated
  references to the same block, we need to devise an effective caching policy
  as well. While optimum offline algorithms have been recently designed for the
  problem, in the online case, no effective algorithm was previously known. Our
  main contribution is a deterministic online algorithm threshold-LRU which
  achieves O((MD/L) {sup 2/3}) competitive ratio and a randomized online
  algorithm threshold-MARK which achieves O({square root}(MD/L) log(MD/L))
  competitive ratio for the caching/prefetching problem on the parallel disk
  model (PDM), where D is the number of disks, M is the size of fast memory
  buffer, and M + L is the amount of lookahead available in the request
  sequence. The best-known lower bound on the competitive ratio is {Omega}(
  {square root}MD/L) for lookahead L GRE M in both models. We also show that if
  the deterministic online algorithm is allowed to have twice the memory of the
  offline then a tight competitive ratio of {Theta}( {square root}MD/L) can be
  achieved. This problem generalizes the well-known paging problem on a single
  disk to the parallel disk model.}
}

@Article{shen:data-management,
  author = {X. H. Shen and W. K. Liao and A. Chouldhary and G. Memik and M.
  Kandemir},
  title = {A high-performance application data environment for large-scale
  scientific computations},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {2003},
  month = {December},
  volume = {14},
  number = {12},
  pages = {1262--1274},
  keywords = {data management, scientific applications, workflow, parallel file
  systems, pario-bib},
  abstract = {Effective high-level data management is becoming an important
  issue with more and more scientific applications manipulating huge amounts of
  secondary-storage and tertiary-storage data using parallel processors. A
  major problem facing the current solutions to this data management problem is
  that these solutions either require a deep understanding of specific data
  storage architectures and file layouts to obtain the best performance (as in
  high-performance storage management systems and parallel file systems), or
  they sacrifice significant performance in exchange for ease-of-use and
  portability (as in traditional database management systems). We discuss the
  design, implementation, and evaluation of a novel application development
  environment for scientific computations. This environment includes a number
  of components that make it easy for the programmers to code and run their
  applications without much programming effort and, at the same time, to
  harness the available computational and storage power on parallel
  architectures. (39 refs.)}
}

@Article{shen:dpfs,
  author = {Xiaohui H. Shen and Alok Choudhary},
  title = {A high-performance distributed parallel file system for
  data-intensive computations},
  journal = {Journal of Parallel and Distributed Computing},
  year = {2004},
  month = {September},
  volume = {64},
  number = {10},
  pages = {1157--1167},
  institution = {Northwestern Univ, Dept Elect \& Comp Engn, Ctr Parallel \&
  Distributed Comp, Evanston, IL 60208 USA; Northwestern Univ, Dept Elect \&
  Comp Engn, Ctr Parallel \& Distributed Comp, Evanston, IL 60208 USA},
  publisher = {ACADEMIC PRESS INC ELSEVIER SCIENCE},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL = {http://dx.doi.org/10.1016/j.jpdc.2004.07.001},
  keywords = {distributed file system, parallel file system, striping,
  pario-bib},
  abstract = {One of the challenges brought by large-scale scientific
  applications is how to avoid remote storage access by collectively using
  sufficient local storage resources to hold huge amounts of data generated by
  the simulation while providing high-performance I/O. DPFS, a distributed
  parallel file system, is designed and implemented to address this problem.
  DPFS collects locally distributed and unused storage resources as a
  supplement to the internal storage of parallel computing systems to satisfy
  the storage capacity requirement of large-scale applications. In addition,
  like parallel file systems, DPFS provides striping mechanisms that divide a
  file into small pieces and distributes them across multiple storage devices
  for parallel data access. The unique feature of DPFS is that it provides
  three file levels with each file level corresponding to a file striping
  method. In addition to the traditional linear striping method, DPFS also
  provides a novel Multidimensional striping method that can solve performance
  problems of linear striping for many popular access patterns. Other issues
  such as load-balancing and user interface are also addressed in DPFS. (C)
  2004 Elsevier Inc. All rights reserved.}
}

@Article{shi:dma-raid,
  author = {Zhan Shi and Jiangling Zhang and Xinrong Zhou},
  title = {Using {DMA} aligned buffer to improve software RAID performance},
  journal = {Lecture Notes in Computer Science},
  booktitle = {4th International Conference on Computational Science (ICCS
  2004); June 6-9, 2004; Krakow, POLAND},
  editor = {Bubak, M; VanAlbada, GD; Sloot, PMA; Dongarra, JJ},
  year = {2004},
  month = {June},
  volume = {3038},
  pages = {355--362},
  institution = {Huazhong Univ Sci \& Technol, Dept Comp Sci, Natl Storage Lab,
  Wuhan 430074, Peoples R China; Abo Akad Univ, Turku Ctr Comp Sci, FIN-20520
  Turku, Finland},
  publisher = {Springer-Verlag},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL =
  {http://springerlink.metapress.com/openurl.asp?genre=article&issn=0302-9743&volume=3038&spage=355},
  keywords = {DMA, software RAID, performance, DMA aligned buffer, DAB,
  pario-bib},
  abstract = {While the storage market grows rapidly, software RAID, as a
  low-cost solution, becomes more and more important nowadays. However the
  performance of software RAID is greatly constrained by its implementation.
  Varies methods have been taken to improve its performance. By integrating a
  novel buffer mechanism - DMA aligned buffer (DAB) into software RAID kernel
  driver, we achieved a significant performance improvement, especially on
  small I/O requests.}
}

@Article{shieh:dsm-pario,
  author = {Ce-Kuen Shieh and Su-Cheong Mac and Jyh-Chang Ueng},
  title = {Improving the performance of distributed shared memory systems via
  parallel file input/output},
  journal = {Journal of systems and software},
  year = {1998},
  month = {December},
  volume = {44},
  number = {1},
  pages = {3--15},
  URL = {http://dx.doi.org/10.1016/S0164-1212(98)10039-0},
  keywords = {distributed shared memory, parallel I/O, file I/O, file system,
  virtual memory, pario-bib},
  comment = {A parallel-I/O scheme for a system using DSM, which has one disk
  per node. The file is initiaally placed on node 0 , Application runs once,
  system then collects informaton about access pattern. The file is
  redistributed across all disks. Application must do all file accesses from
  node 0, but in subsequent runs this causes the block to be read from its disk
  into the local memory of the attached node, and VM-mapped into the correct
  place. Later page faults will move the data to the node needing the data
  first (if the redistribution is done well, that's the same node, so no
  movement is needed). At the end of the program, output data are written to
  the output file, on the local disk. Thus: input files go to node 0 on the
  first run, then are redistributed before second run, and output files are
  created across all nodes but are written only at file close and only to the
  closest disk. Limitations: files must be wholly read during application
  initialization, from node 0. Files must be wholly written out during the
  application completion. Files are immutable. You must have one slow run
  initially. Input files must fit on one disk. I read sections 1-2, then
  skimmed the rest.}
}

@InProceedings{shin:hartsio,
  author = {Kang G. Shin and Greg Dykema},
  title = {A Distributed {I/O} Architecture for {HARTS}},
  booktitle = {Proceedings of the 17th Annual International Symposium on
  Computer Architecture},
  year = {1990},
  pages = {332--342},
  keywords = {parallel I/O, multiprocessor architecture, MIMD, fault tolerance,
  pario-bib},
  comment = {HARTS is a multicomputer connected with a wrapped hexagonal mesh,
  with an emphasis on real-time and fault tolerance. The mesh consists of
  network routing chips. Hanging off each is a small bus-based multiprocessor
  ``node''. They consider how to integrate I/O devices into this architecture:
  attach device controllers to processors, to network routers, to node busses,
  or via a separate network. They decided to compromise and hang each I/O
  controller off three network routers, in the triangles of the hexagonal mesh.
  This keeps the traffic off of the node busses, and allows multiple paths to
  each controller. They discuss the reachability and hop count in the presence
  of failed nodes and links.}
}

@InProceedings{shirriff:sawmill,
  author = {Ken Shirriff and John Ousterhout},
  title = {Sawmill: A High-Bandwidth Logging File System},
  booktitle = {Proceedings of the 1994 Summer USENIX Technical Conference},
  year = {1994},
  pages = {125--136},
  keywords = {file system, parallel I/O, pario-bib, RAID},
  comment = {This is a file system based on LFS and run on the RAID-II
  prototype (see drapeau:raid-ii). It uses the RAID-II controller's memory (32
  MB) to pipeline data transfers from the RAID disks directly to (from) the
  network. Thus, data never flows through the server CPU or memory. The server
  remains in control, telling the controller where each block goes, etc. They
  get very high data rates. And despite being much faster than the RAID for
  small writes, they were still CPU-limited, because the CPU had to handle all
  the little requests.}
}

@Article{shock:database,
  author = {Carter T. Shock and Chialin Chang and Bongki Moon and Anurag
  Acharya and Larry Davis and Joel Saltz and Alan Sussman},
  title = {The design and evaluation of a high-performance earth science
  database},
  journal = {Parallel Computing},
  year = {1998},
  month = {January},
  volume = {24},
  number = {1},
  pages = {65--89},
  URL = {http://dx.doi.org/10.1016/S0167-8191(97)00117-8},
  keywords = {parallel I/O, database, pario-bib},
  comment = {Part of a special issue.}
}

@TechReport{shriver:api-tr,
  author = {Elizabeth A.~M. Shriver and Leonard F. Wisniewski},
  title = {An {API} for Choreographing Data Accesses},
  year = {1995},
  month = {November},
  number = {PCS-TR95-267},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {the authors},
  URL = {ftp://ftp.cs.dartmouth.edu/TR/TR95-267.ps.Z},
  keywords = {parallel I/O, multiprocessor file system interface, pario-bib},
  abstract = {Current APIs for multiprocessor multi-disk file systems are not
  easy to use in developing out-of-core algorithms that choreograph parallel
  data accesses. Consequently, the efficiency of these algorithms is hard to
  achieve in practice. We address this deficiency by specifying an API that
  includes data-access primitives for data choreography. With our API, the
  programmer can easily access specific blocks from each disk in a single
  operation, thereby fully utilizing the parallelism of the underlying storage
  system. Our API supports the development of libraries of commonly-used
  higher-level routines such as matrix-matrix addition, matrix-matrix
  multiplication, and BMMC (bit-matrix-multiply/complement) permutations. We
  illustrate our API in implementations of these three high-level routines to
  demonstrate how easy it is to use.},
  comment = {Also published as Courant Institute Tech Report 708.}
}

@InCollection{shriver:models-algs,
  author = {Elizabeth Shriver and Mark Nodine},
  title = {An Introduction to Parallel {I/O} Models and Algorithms},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {2},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {31--68},
  publisher = {Kluwer Academic Publishers},
  keywords = {parallel I/O algorithms, out-of-core, pario-bib},
  abstract = {Problems whose data are too large to fit into main memory are
  called {\it out-of-core} problems. Out-of-core parallel-I/O algorithms can
  handle much larger problems than in-memory variants and have much better
  performance than single-device variants. However, they are not commonly
  used---partly because the understanding of them is not widespread. Yet such
  algorithms ought to be growing in importance because they address the needs
  of users with ever-growing problem sizes and ever-increasing performance
  needs. \par This paper addresses this lack of understanding by presenting an
  introduction to the data-transfer models on which most of the out-of-core
  parallel-I/O algorithms are based, with particular emphasis on the Parallel
  Disk Model. Sample algorithms are discussed to demonstrate the paradigms
  (algorithmic techniques) used with these models. \par Our aim is to provide
  insight into both the paradigms and the particular algorithms described,
  thereby also providing a background for understanding a range of related
  solutions. It is hoped that this background would enable the appropriate
  selection of existing algorithms and the development of new ones for current
  and future out-of-core problems.},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@InProceedings{si-woong:cluster,
  author = {Jang Si-Woong and Chung Ki-Dong and Sam Coleman},
  title = {Design and Implementation of a Network-Wide Concurrent File System
  in a Workstation Cluster},
  booktitle = {Proceedings of the Fourteenth IEEE Symposium on Mass Storage
  Systems},
  year = {1995},
  month = {September},
  pages = {239--245},
  publisher = {IEEE Computer Society Press},
  URL = {http://www.computer.org/conferen/mss95/woong/woong.htm},
  keywords = {mass storage, cluster computing, distributed file system,
  parallel I/O, pario-bib},
  abstract = {We estimate the performance of a network-wide concurrent file
  system implemented using conventional disks as disk arrays. Tests were
  carried out on both single system and network-wide environments. On single
  systems, a file was split across several disks to test the performance of
  file I/O operations. We concluded that performance was proportional to the
  number of disks, up to four, on a system with high computing power.
  Performance of a system with low computing power, however, did not increase,
  even with more than two disks. When we split a file across disks in a
  network-wide system called the Network-wide Concurrent File System (N-CFS),
  we found performance similar to or slightly higher than that of disk arrays
  on single systems. Since file access through N-CFS is transparent, this
  system enables traditional disks on single and networked systems to be used
  as disk arrays for I/O intensive jobs.}
}

@Article{sicola:storageworks,
  author = {Stephen J. Sicola},
  title = {The Architecture and Design of {HS}-series {StorageWorks} Array
  Controllers},
  journal = {Digital Technical Journal},
  year = {1994},
  month = {Fall},
  volume = {6},
  number = {4},
  pages = {5--25},
  keywords = {disk controller, RAID, parallel I/O, pario-bib},
  comment = {Describes the RAID controller for the DEC StorageWorks product.}
}

@Article{simitci:patterns,
  author = {Huseyin Simitci and Daniel Reed},
  title = {A Comparison of Logical and Physical Parallel {I/O} Patterns},
  journal = {The International Journal of High Performance Computing
  Applications},
  year = {1998},
  month = {Fall},
  volume = {12},
  number = {3},
  pages = {364--380},
  keywords = {parallel I/O application, pario-bib},
  abstract = {Although there are several extant studies of parallel scientific
  application request patterns, there is little experimental data on the
  correlation of physical input/output patterns with application input/output
  stimuli. To understand these correlations, we have instrumented the SCSI
  device drivers of the Intel Paragon OSF/1 operating system to record key
  physical input/output activities and have correlated this data with the
  input/output patterns of scientific applications captured via the Pablo
  analysis toolkit. Our analysis shows that disk hardware features profoundly
  affect the distribution of request delays and that current parallel file
  systems respond to parallel application input/output patterns in non-scalable
  ways.},
  comment = {In a Special Issue on I/O in Parallel Applications, volume 12,
  numbers 3 and 4.}
}

@InProceedings{simitci:striping,
  author = {Huseyin Simitci and Daniel A. Reed},
  title = {Adaptive Disk Striping for Parallel Input/Output},
  booktitle = {Proceedings of the Seventh NASA Goddard Conference on Mass
  Storage Systems and Technologies},
  year = {1999},
  month = {March},
  pages = {88--102},
  publisher = {IEEE Computer Society Press},
  address = {San Diego, CA},
  URL = {http://www-pablo.cs.uiuc.edu/Publications/Papers/Goddard99.ps},
  keywords = {adaptive striping, disk striping, parallel I/O, pario-bib}
}

@InProceedings{sinclair:instability,
  author = {James B. Sinclair and Jay Tang and Peter J. Varman},
  title = {Instability in Parallel {I/O} Systems},
  booktitle = {Proceedings of the IPPS~'94 Workshop on Input/Output in Parallel
  Computer Systems},
  year = {1994},
  month = {April},
  pages = {16--35},
  organization = {Rice University},
  note = {Also appeared in Computer Architecture News 22(4)},
  later = {sinclair:instability-book},
  keywords = {parallel I/O, pario-bib},
  comment = {They study the performance of a parallel I/O system when several
  concurrent processes are accessing a shared set of disks, using a common
  buffer pool. They found that under certain circumstances the system can
  become unstable, in that some subset of processes monopolize all of the
  resources, bringing the others to a virtual halt. They use analytical models
  to show that instability can occur if every process has distinct input and
  output disks, reads are faster than writes, disk scheduling policy of a
  certain class, and processes don't wait for other resources.}
}

@InCollection{sinclair:instability-book,
  author = {J.~B. Sinclair and J. Tang and P.~J. Varman},
  title = {Placement-Related Problems in Shared Disk {I/O}},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {12},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {271--289},
  publisher = {Kluwer Academic Publishers},
  earlier = {sinclair:instability},
  keywords = {parallel I/O, pario-bib},
  abstract = {In a shared-disk parallel I/O system, several processes may be
  accessing the disks concurrently. An important example is concurrent external
  merging arising in database management systems with multiple independent sort
  queries. Such a system may exhibit instability, with one of the processes
  racing ahead of the others and monopolizing I/O resources. This race can lead
  to serialization of the processes and poor disk utilization, even when the
  static load on the disks is balanced. The phenomenon can be avoided by proper
  layout of data on the disks, as well as through other I/O management
  strategies. This has implications for both data placement in multiple disk
  systems and task partitioning for parallel processing.},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@InProceedings{sinclair:placement,
  author = {J. B. Sinclair and J. Tang and P. J. Varman and B. R. Iyer},
  title = {Impact of Data Placement on Parallel {I/O} Systems},
  booktitle = {Proceedings of the 1993 International Conference on Parallel
  Processing},
  year = {1993},
  pages = {III--276--279},
  publisher = {CRC Press},
  address = {St. Charles, IL},
  keywords = {parallel I/O, pario-bib},
  comment = {Several external merges (many sorted runs into one) are
  concurrently in action. Where do you put their input and output runs, that
  is, on which disks? Only input runs are striped, and usually on a subset of
  disks.}
}

@TechReport{singh:adopt,
  author = {Tarvinder Pal Singh and Alok Choudhary},
  title = {{ADOPT}: A Dynamic scheme for Optimal PrefeTching in Parallel File
  Systems},
  year = {1994},
  month = {June},
  institution = {NPAC},
  URL = {ftp://erc.cat.syr.edu/ece/choudhary/PASSION/ADOPT.ps.Z},
  keywords = {parallel I/O, pario-bib},
  comment = {They describe a prefetching scheme where hints can be provided
  from the programmer, compiler, or runtime library to the I/O node. These
  hints seem to take the form of a sequence (all in order) or a set (only one
  of many, from conditional expressions). The hints come from each process, not
  collectively. Then, the I/O node keeps these specifications and uses them to
  drive prefetching when there is no other work to do. They rotate among the
  specifications of many processes. Later they hope to examine more complex
  scheduling strategies and buffer-space allocation strategies.}
}

@InProceedings{sivathanu:dgraid,
  author = {Muthian Sivathanu and Vijayan Prabhakaran and Andrea C.
  Arpaci-Dusseau and Remzi H. Arpaci-Dusseau},
  title = {Improving Storage System Availability with {D-GRAID}},
  booktitle = {Proceedings of the USENIX FAST '04 Conference on File and
  Storage Technologies},
  year = {2004},
  month = {March},
  pages = {15--30},
  organization = {University of Wisconsin, Madison},
  publisher = {USENIX Association},
  address = {San Francisco, CA},
  URL = {http://www.usenix.org/events/fast04/tech/sivathanu.html},
  keywords = {fault tolerance, disk failure, RAID, D-GRAID, pario-bib},
  abstract = {We present the design, implementation, and evaluation of D-GRAID,
  a gracefully-degrading and quickly-recovering RAID storage array. D-GRAID
  ensures that most files within the file system remain available even when an
  unexpectedly high number of faults occur. D-GRAID also recovers from failures
  quickly, restoring only live file system data to a hot spare. Both graceful
  degradation and live-block recovery are implemented in a prototype SCSI-based
  storage system underneath unmodified file systems, demonstrating that
  powerful "file-system like" functionality can be implemented behind a narrow
  block-based interface.},
  comment = {Awarded best student paper.}
}

@InCollection{smirni:bevolutionary,
  author = {Evgenia Smirni and Ruth A. Aydt and Andrew A. Chien and Daniel A.
  Reed},
  title = {{I/O} Requirements of Scientific Applications: An Evolutionary
  View},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {40},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {576--594},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {smirni:evolutionary},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {I/O, workload characterization, scientific computing, parallel
  I/O, pario-bib},
  comment = {Part of jin:io-book, modified from smirni:evolutionary.}
}

@InProceedings{smirni:evolutionary,
  author = {Evgenia Smirni and Ruth A. Aydt and Andrew A. Chien and Daniel A.
  Reed},
  title = {{I/O} Requirements of Scientific Applications: An Evolutionary
  View},
  booktitle = {Proceedings of the Fifth IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1996},
  pages = {49--59},
  publisher = {IEEE Computer Society Press},
  address = {Syracuse, NY},
  later = {smirni:bevolutionary},
  keywords = {I/O, workload characterization, scientific computing, parallel
  I/O, pario-bib},
  abstract = {The modest I/O configurations and file system limitations of many
  current high-performance systems preclude solution of problems with large I/O
  needs. I/O hardware and file system parallelism is the key to achieving high
  performance. We analyze the I/O behavior of several versions of two
  scientific applications on the Intel Paragon XP/S. The versions involve
  incremental application code enhancements across multiple releases of the
  operating system. Studying the evolution of I/O access patterns underscores
  the interplay between application access patterns and file system features.
  Our results show that both small and large request sizes are common, that at
  present, application developers must manually aggregate small requests to
  obtain high disk transfer rates, that concurrent file accesses are frequent,
  and that appropriate matching of the application access pattern and the file
  system access mode can significantly increase application I/O performance.
  Based on these results, we describe a set of file system design principles.},
  comment = {They study two applications over several versions, using Pablo to
  capture the I/O activity. They thus watch as application developers improve
  the applications use of I/O modes and request sizes. Both applications move
  through three phases: initialization, computation (with out-of-core I/O or
  checkpointing I/O), and output. They found it necessary to tune the I/O
  request sizes to match the parameters of the I/O system. In the initial
  versions, the code used small read and write requests, which were (according
  to the developers) the "easiest and most natural implementation for their
  I/O." They restructured the I/O to make bigger requests, which better matched
  the capabilities of Intel PFS. They conclude that asynchronous and collective
  operations are imperative. They would like to see a file system that can
  adapt dynamically to adjust its policies to the apparent access patterns.
  Automatic request aggregation of some kind seems like a good idea; of course,
  that is one feature of a buffer cache.}
}

@Article{smirni:lessons,
  author = {E. Smirni and D.A. Reed},
  title = {Lessons from characterizing the input/output behavior of parallel
  scientific applications},
  journal = {Performance Evaluation: An International Journal},
  year = {1998},
  month = {June},
  volume = {33},
  number = {1},
  pages = {27--44},
  publisher = {Elsevier Science},
  earlier = {smirni:workload},
  URL = {http://dx.doi.org/10.1016/S0166-5316(98)00009-1},
  keywords = {workload characterization, parallel I/O, scientific applications,
  pario-bib},
  abstract = {As both processor and interprocessor communication hardware is
  evolving rapidly with only moderate improvements to file system performance
  in parallel systems, it is becoming increasingly difficult to provide
  sufficient input/output (I/O) performance to parallel applications. I/O
  hardware and file system parallelism are the key to bridging this performance
  gap. Prerequisite to the development of efficient parallel file systems is
  the detailed characterization of the I/O demands of parallel applications. In
  the paper, we present a comparative study of parallel I/O access patterns,
  commonly found in I/O intensive scientific applications. The Pablo
  performance analysis tool and its I/O extensions is a valuable resource in
  capturing and analyzing the I/O access attributes and their interactions with
  extant parallel I/O systems. This analysis is instrumental in guiding the
  development of new application programming interfaces (APIs) for parallel
  file systems and effective file system policies that respond to complex
  application I/O requirements.},
  comment = {This paper compares the I/O performance of five scientific
  applications from the scalable I/O initiative (SIO) suite of applications.
  Their goals are to collect detailed performance data on applications
  characteristics and access patterns and to use that information to design and
  evaluate parallel file system policies and parallel file system APIs. The
  related work section gives a nice overview of recent I/O characterization
  studies. They use the Pablo \cite{reed:pablo} performance analysis
  environment to analyze the performance of their five applications. The
  applications they chose to evaluate include: MESSKIT and NWChem, two
  implementations of the Hartree-Fock method for computational chemistry
  applications; QCRD, a quantum chemical reaction dynamics application; PRISM,
  a parallel 3D numerical simulation of the Navier-Stokes equations that models
  high speed turbulent flow that is periodic in one direction; ECAT, a parallel
  implementation of the Schwinger multichannel method used to calculate
  low-energy electron molecule collisions. \par The results showed that
  applications use a combination of both sequential and interleaved access
  patterns, which shows that there is a clear need for a more complex API than
  what is given by the standard UNIX API. In addition, when applications
  required concurrent accesses, they commonly channeled all I/O requests
  through a single node. Some form of collective I/O would have helped in these
  cases. They also made an observation that despite the existence of several
  parallel I/O APIs, programmers of scientific applications preferred to use
  standard unix. This is mostly due to the lack of an established portable
  standard. Their study was "instrumental in the design and implementation of
  MPI-IO". \par Their section on emerging I/O APIs is particularly interesting.
  They comment that "the diversity of I/O request sizes and patterns suggests
  that achieving high performance is unlikely with a single file system
  policy." Their solution is to have a file system in which the user can give
  "hints" to the file system expressing expected access patterns or to have a
  file system that automatically classifies access patterns. The file system
  can then chose policies to deal with the access patterns.}
}

@InProceedings{smirni:workload,
  author = {E. Smirni and D.A. Reed},
  title = {Workload characterization of input/output intensive parallel
  applications},
  booktitle = {Proceedings of the Conference on Modelling Techniques and Tools
  for Computer Performance Evaluation},
  year = {1997},
  month = {June},
  series = {Lecture Notes in Computer Science},
  volume = {1245},
  pages = {169--180},
  publisher = {Springer-Verlag},
  later = {smirni:lessons},
  URL = {http://vibes.cs.uiuc.edu/Publications/Papers/Tools97.ps.gz},
  keywords = {parallel I/O, pario-bib},
  abstract = {The broadening disparity in the performance of input/output (I/O)
  devices and the performance of processors and communication links on parallel
  systems is a major obstacle to achieving high performance for a wide range of
  parallel applications. I/O hardware and file system parallelism are the keys
  to bridging this performance gap. A prerequisite to the development of
  efficient parallel file systems is detailed characterization of the I/O
  demands of parallel applications. In this paper, we present a comparative
  study of the I/O access patterns commonly found in I/O intensive parallel
  applications. Using the Pablo performance analysis environment and its I/O
  extensions we captured application I/O access patterns and analyzed their
  interactions with current parallel I/O systems. This analysis has proven
  instrumental in guiding the development of new application programming
  interfaces (APIs) for parallel file systems and in developing effective file
  system policies that can adaptively respond to complex application I/O
  requirements.},
  comment = {see smirni:lessons}
}

@Article{smotherman:taxonomy,
  author = {Mark Smotherman},
  title = {A Sequencing-based Taxonomy of {I/O} Systems and Review of
  Historical Machines},
  journal = {Computer Architecture News},
  year = {1989},
  month = {September},
  volume = {17},
  number = {5},
  pages = {5--15},
  keywords = {I/O architecture, historical summary, pario-bib},
  comment = {Classifies I/O systems by how they initiate and terminate I/O.
  Uniprocessor and Multiprocessor systems.}
}

@Misc{snir:hpfio,
  author = {Marc Snir},
  title = {Proposal for {IO}},
  year = {1992},
  month = {August 31,},
  howpublished = {Posted to HPFF I/O Forum},
  note = {Second Draft},
  keywords = {parallel I/O, multiprocessor file system interface, pario-bib},
  comment = {An outline of two possible ways to specify mappings of arrays to
  storage nodes in a multiprocessor, and to make unformatted parallel transfers
  of multiple records. Seems to apply only to arrays, and to files that hold
  only arrays. It keeps the linear structure of files as sequences of records,
  but in some cases does not preserve the order of data items or of fields
  within subrecords. Tricky to understand unless you know HPF and Fortran 90.}
}

@InProceedings{sobti:personalraid,
  author = {Sumeet Sobti and Nitin Garg and Chi Zhang and Xiang Yu and Arvind
  Krishnamurty and Randolph {Y. Wang}},
  title = {{PersonalRAID}: Mobile Storage for Distributed and Disconnected
  Computers},
  booktitle = {Proceedings of the USENIX FAST '02 Conference on File and
  Storage Technologies},
  year = {2002},
  month = {January},
  pages = {159--174},
  publisher = {USENIX Association},
  address = {Monterey, CA},
  URL =
  {http://www.usenix.org/publications/library/proceedings/fast02/sobti.html},
  keywords = {file systems, pario-bib},
  abstract = {This paper presents the design and implementation of a mobile
  storage system called a PersonalRAID. PersonalRAID manages a number of
  disconnected storage devices. At the heart of a PersonalRAID system is a
  mobile storage device that transparently propagates data to ensure eventual
  consistency. Using this mobile device, a PersonalRAID provides the
  abstraction of a single coherent storage name space that is available
  everywhere, and it ensures reliability by maintaining data redundancy on a
  number of storage devices. One central aspect of the PersonalRAID design is
  that the entire storage system consists solely of a collection of storage
  logs; the log-structured design not only provides an efficient means for
  update propagation, but also allows efficient direct I/O accesses to the logs
  without incurring unnecessary log replay delays. The PersonalRAID prototype
  demonstrates that the system provides the desired transparency and
  reliability functionalities without imposing any serious performance penalty
  on a mobile storage user.}
}

@InProceedings{soloviev:prefetching,
  author = {Valery V. Soloviev},
  title = {Prefetching in Segmented Disk Cache for Multi-Disk Systems},
  booktitle = {Proceedings of the Fourth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1996},
  month = {May},
  pages = {69--82},
  publisher = {ACM Press},
  address = {Philadelphia},
  keywords = {parallel I/O, prefetching, disk cache, disk array, pario-bib},
  abstract = {This paper investigates the performance of a multi-disk storage
  system equipped with a segmented disk cache processing a workload of multiple
  relational scans. Prefetching is a popular method of improving the
  performance of scans. Many modern disks have a multisegment cache which can
  be used for prefetching. We observe that, exploiting declustering as a data
  placement method, prefetching in a segmented cache causes a load imbalance
  among several disks. A single disk becomes a bottleneck, degrading
  performance of the entire system. A variation in disk queue length is a
  primary factor of the imbalance. Using a precise simulation model, we
  investigate several approaches to achieving better balancing. Our metrics are
  a scan response time for the closed-end system and an ability to sustain a
  workload without saturating for the open-end system. We arrive at two main
  conclusions: (1) Prefetching in main memory is inexpensive and effective for
  balancing and can supplement or substitute prefetching in disk cache. (2)
  Disk-level prefetching provides about the same performance as main memory
  prefetching if request queues are managed in the disk controllers rather than
  in the host. Checking the disk cache before queuing requests provides not
  only better request response time but also drastically improves balancing. A
  single cache performs better than a segmented cache for this method.},
  comment = {An interesting paper about disk-controller cache management in
  database workloads. Actually, the workloads are sequential scans of
  partitioned files, which could occur in many kinds of workloads. The
  declustering pattern (partitioning) is a little unusual for most scientific
  parallel I/O veterans, who are used to striping. And the cache-management
  algorithms seem a bit strange, particularly the fact that the cache appears
  to be used only for explicit prefetch requests. Turns out that it is best to
  put the prefetching and disk queueing in the same place, either on the
  controller or in main memory, to avoid load imbalance that arises from
  randomness in the workload, which is accentuated into a big bottleneck and a
  convoy effect.}
}

@InProceedings{soltis:gfs,
  author = {Steven R. Soltis and Thomas M. Ruwart and Matthew T. O'Keefe},
  title = {The {Global File System}},
  booktitle = {Proceedings of the Fifth NASA Goddard Conference on Mass Storage
  Systems and Technologies},
  year = {1996},
  month = {September},
  pages = {319--342},
  publisher = {IEEE Computer Society Press},
  address = {College Park, MD},
  later = {soltis:bgfs},
  URL = {http://gfs.lcse.umn.edu/pubs/nasa_talk96.ps},
  keywords = {distributed file system, data storage, mass storage,
  network-attached disks, disk striping, parallel I/O, pario-bib},
  comment = {see also preslan:gfs}
}

@InProceedings{solworth:mirror,
  author = {John A. Solworth and Cyril U. Orji},
  title = {Distorted Mirrors},
  booktitle = {Proceedings of the First International Conference on Parallel
  and Distributed Information Systems},
  year = {1991},
  month = {December},
  pages = {10--17},
  later = {solworth:mirror2},
  keywords = {disk mirroring, parallel I/O, pario-bib},
  comment = {Write one disk (the master) in the usual way, and write the slave
  disk at the closest free block. Actually, they propose to logically partition
  the two disks so that each disk has a master partition and a slave partition.
  Up to 80\% improvement in small-write performance, while retaining good
  sequential read performance.}
}

@Article{solworth:mirror2,
  author = {John A. Solworth and Cyril U. Orji},
  title = {Distorted Mirrors},
  journal = {Journal of Distributed and Parallel Databases},
  year = {1993},
  month = {January},
  volume = {1},
  number = {1},
  pages = {81--102},
  earlier = {solworth:mirror},
  keywords = {disk mirroring, parallel I/O, pario-bib},
  comment = {See solworth:mirror.}
}

@InProceedings{spencer:pipeline,
  author = {M. Spencer and R. Ferreira and M. Beynon and T. Kurc and U.
  Catalyurek and A. Sussman and J. Saltz},
  title = {Executing multiple pipelined data analysis operations in the Grid},
  booktitle = {Proceedings of SC2002: High Performance Networking and
  Computing},
  year = {2002},
  month = {November},
  address = {Baltimore, Maryland},
  URL = {citeseer.nj.nec.com/spencer02executing.html},
  keywords = {DataCutter, pipeline, dataflow, pario-bib},
  abstract = {Processing of data in many data analysis applications can be
  represented as an acyclic, coarse grain data flow, from data sources to the
  client. This paper is concerned with scheduling of multiple data analysis
  operations, each of which is represented as a pipelined chain of processing
  on data. We define the scheduling problem for effectively placing components
  onto Grid resources, and propose two scheduling algorithms. Experimental
  results are presented using a visualization application.}
}

@InProceedings{srinilta:strategies,
  author = {Chutimet Srinilta and Divyesh Jadav and Alok Choudhary},
  title = {Design and Evaluation of Data Storage and Retrieval Strategies in a
  Distributed Memory Continuous Media Server},
  booktitle = {Proceedings of the Eleventh International Parallel Processing
  Symposium},
  year = {1997},
  month = {April},
  pages = {360--367},
  URL = {http://www.ece.nwu.edu/~csrinilt/mm/pub/ipps97.ps},
  keywords = {threads, parallel I/O, pario-bib},
  abstract = {High performance servers and high-speed networks will form the
  backbone of the infra-structure required for distributed multimedia
  information systems. Given that the goal of such a server is to support
  hundreds of interactive data streams simultaneously, various tradeoffs are
  possible with respect to the storage of data on secondary memory, and its
  retrieval therefrom. In this paper we identify and evaluate these tradeoffs.
  We evaluate the effect of varying the stripe factor and also the performance
  of batched retrieval of disk--resident data. We develop a methodology to
  predict the stream capacity of such a server. The evaluation is done for both
  uniform and skewed access patterns. Experimental results on the Intel Paragon
  computer are presented.}
}

@MastersThesis{stabile:disks,
  author = {James Joseph Stabile},
  title = {Disk Scheduling Algorithms for a Multiple Disk System},
  year = {1988},
  school = {UC Davis},
  keywords = {parallel I/O, parallel file system, disk mirroring, disk
  scheduling, pario-bib},
  comment = {Describes simulation based on model of disk access pattern.
  Multiple-disk system, much like in matloff:multidisk. Files stored in two
  copies, each on a separate disk, but there are more than two disks, so this
  differs from mirroring. He compares several disk scheduling algorithms. A
  variant of SCAN seems to be the best.}
}

@Article{steenkiste:net,
  author = {Peter Steenkiste},
  title = {A High-Speed Network Interface for Distributed-Memory Systems:
  Architecture and Applications},
  journal = {ACM Transactions on Computer Systems},
  year = {1997},
  month = {February},
  volume = {15},
  number = {1},
  pages = {75--109},
  publisher = {ACM Press},
  keywords = {parallel computer architecture, interconnection network, network
  interface, distributed memory, systolic array, input/output, parallel I/O,
  pario-bib},
  comment = {See also steenkiste:interface, kung:network, hemy:gigabit,
  bornstein:reshuffle, and gross:io.}
}

@MastersThesis{stockinger:dictionary,
  author = {Heinz Stockinger},
  title = {Dictionary on Parallel Input/Output},
  year = {1998},
  month = {February},
  school = {Department of Data Engineering, University of Vienna},
  URL = {http://www.cs.dartmouth.edu/pario/dictionary.ps.Z},
  keywords = {dictionary, survey, parallel I/O, pario-bib},
  comment = {A tremendous resource.}
}

@InCollection{stodolsky:blogging,
  author = {Daniel Stodolsky and Garth Gibson and Mark Holland},
  title = {Parity Logging Overcoming the Small Write Problem in Redundant Disk
  Arrays},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {5},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {67--80},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {stodolsky:logging},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, RAID, redundancy, reliability, disk array,
  pario-bib},
  comment = {Part of jin:io-book; reformatted version of stodolsky:logging.}
}

@Article{stodolsky:jlogging,
  author = {Daniel Stodolsky and Mark Holland and William V. {Courtright II}
  and Garth A. Gibson},
  title = {Parity-Logging Disk Arrays},
  journal = {ACM Transactions on Computer Systems},
  year = {1994},
  month = {August},
  volume = {12},
  number = {3},
  pages = {206--235},
  publisher = {ACM Press},
  earlier = {stodolsky:logging},
  keywords = {parallel I/O, RAID, redundancy, reliability, pario-bib},
  comment = {See stodolsky:logging. An in-between version is CMU-CS-94-170,
  stodolsky:logging-tr.}
}

@InProceedings{stodolsky:logging,
  author = {Daniel Stodolsky and Garth Gibson and Mark Holland},
  title = {Parity Logging: Overcoming the Small Write Problem in Redundant Disk
  Arrays},
  booktitle = {Proceedings of the 20th Annual International Symposium on
  Computer Architecture},
  year = {1993},
  pages = {64--75},
  earlier = {stodolsky:logging-tr},
  later = {stodolsky:jlogging},
  URL =
  {http://www.cs.cmu.edu/afs/cs.cmu.edu/project/pdl/ftp/ParityLogging/ISCA93.ps},
  keywords = {parallel I/O, RAID, redundancy, reliability, disk array,
  pario-bib},
  abstract = {Parity encoded redundant disk arrays provide highly reliable,
  cost effective secondary storage with high performance for read accesses and
  large write accesses. Their performance on small writes, however, is much
  worse than mirrored disks - the traditional, highly reliable, but expensive
  organization for secondary storage. Unfortunately, small writes are a
  substantial portion of the I/O workload of many important, demanding
  applications such as on-line transaction processing. This paper presents
  parity logging, a novel solution to the small write problem for redundant
  disk arrays. Parity logging applies journalling techniques to substantially
  reduce the cost of small writes. We provide a detailed analysis of parity
  logging and competing schemes - mirroring, floating storage, and RAID level 5
  - and verify these models by simulation. Parity logging provides performance
  competitive with mirroring, the best of the alternative single failure
  tolerating disk array organizations. However, its overhead cost is close to
  the minimum offered by RAID level 5. Finally, parity logging can exploit data
  caching much more effectively than all three alternative approaches.},
  comment = {Cite stodolsky:jlogging. Earlier version is CMU-CS-93-200. Parity
  logging to improve small writes. Log all parity updates; when it fills, go
  redo parity disk. Actually distribute the parity and log across all disks.
  Performance is comparable to, or exceeding, mirroring. Also handling double
  failures.}
}

@TechReport{stodolsky:logging-tr,
  author = {Daniel Stodolsky and Mark Holland and William V. {Courtright II}
  and Garth A. Gibson},
  title = {A Redundant Disk Array Architecture for Efficient Small Writes},
  year = {1994},
  month = {July},
  number = {CMU-CS-94-170},
  institution = {Carnegie Mellon University},
  note = {Revised from CMU-CS-93-200.},
  later = {stodolsky:logging},
  URL =
  {http://www.cs.cmu.edu/afs/cs.cmu.edu/project/pdl/ftp/ParityLogging/TR94-170.ps},
  keywords = {parallel I/O, disk array, RAID, redundancy, reliability,
  pario-bib},
  abstract = {Parity encoded redundant disk arrays provide highly reliable,
  cost effective secondary storage with high performance for reads and large
  writes. Their performance on small writes, however, is much worse than
  mirrored disks - the traditional, highly reliable, but expensive organization
  for second ary storage. Unfortunately, small writes are a substantial portion
  of the I/O workload of many impor tant, demanding applications such as
  on-line transaction processing. This paper presents parity logging, a novel
  solution to the small write problem for redundant disk arrays. Parity logging
  applies journalling techniques to substantially reduce the cost of small
  writes. We provide detailed models of parity logging and competing schemes -
  mirroring, floating storage, and RAID level 5 - and verify these models by
  simulation. Parity logging provides performance competitive with mirroring,
  but with capacity overhead close to the minimum offered by RAID level 5.
  Finally, parity logging can exploit data caching more effectively than all
  three alternative approaches.}
}

@Article{stone:query,
  author = {Harold S. Stone},
  title = {Parallel Querying of Large Databases: {A} Case Study},
  journal = {IEEE Computer},
  year = {1987},
  month = {October},
  volume = {20},
  number = {10},
  pages = {11--21},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, database, SIMD, connection machine, pario-bib},
  comment = {See also IEEE Computer, Jan 1988, p. 8 and 10. Examines a database
  query that is parallelized for the Connection Machine. He shows that in many
  cases, a smarter serial algorithm that reads only a portion of the database
  (through an index) will be faster than 64K processors reading the whole
  database. Uses a simple model for the machines to show this. Reemphasizes the
  point of Boral and DeWitt that I/O is the bottleneck of a database machine,
  and that parallelizing the processing will not necessarily help a great
  deal.}
}

@InCollection{stonebraker:bradd,
  author = {Michael Stonebraker and Gerhard A. Schloss},
  title = {Distributed {RAID}--- A New Multiple Copy Algorithm},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {6},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {81--89},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {stonebraker:radd},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {disk striping, reliability, pario-bib},
  comment = {Part of jin:io-book; reformatted version of stonebraker:radd.}
}

@InProceedings{stonebraker:radd,
  author = {Michael Stonebraker and Gerhard A. Schloss},
  title = {Distributed {RAID} --- {A} New Multiple Copy Algorithm},
  booktitle = {Proceedings of 6th International Data Engineering Conference},
  year = {1990},
  pages = {430--437},
  later = {stonebraker:bradd},
  keywords = {disk striping, reliability, pario-bib},
  comment = {This is about ``RADD'', a distributed form of RAID. Meant for
  cases where the disks are physically distributed around several sites, and no
  one controller controls them all. Much lower space overhead than any
  mirroring technique, with comparable normal-mode performance at the expense
  of failure-mode performance.}
}

@TechReport{stonebraker:xprs,
  author = {Michael Stonebraker and Randy Katz and David Patterson and John
  Ousterhout},
  title = {The Design of {XPRS}},
  year = {1988},
  month = {March},
  number = {UCB/ERL M88/19},
  institution = {UC Berkeley},
  keywords = {parallel I/O, disk array, RAID, Sprite, disk architecture,
  database, pario-bib},
  comment = {Designing a DBMS for Sprite and RAID. High availability, high
  performance. Shared memory multiprocessor. Allocates extents to files that
  are a interleaved over a variable number of disks, and over a contiguous set
  of tracks on those disks.}
}

@MastersThesis{subramaniam:msthesis,
  author = {Mahesh Subramaniam},
  title = {Efficient Implementation of Server-Directed I/O},
  year = {1996},
  month = {June},
  school = {Dept. of Computer Science, University of Illinois},
  URL = {http://bunny.cs.uiuc.edu/CDR/pubs/mahesh-thesis.html},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  abstract = {Parallel computers are a cost effective approach to providing
  significant computational resources to a broad range of scientific and
  engineering applications. Due to the relatively lower performance of the I/O
  subsystems on these machines and due to the significant I/O requirements of
  these applications, the I/O performance can become a major bottleneck.
  Optimizing the I/O phase of these applications poses a significant challenge.
  A large number of these scientific and engineering applications perform
  simple operations on multidimensional arrays and providing an easy and
  efficient mechanism for implementing these operations is important. The Panda
  array I/O library provides simple high level interfaces to specify collective
  I/O operations on multidimensional arrays in a distributed memory
  single-program multiple-data (SPMD) environment. The high level information
  provided by the user through these interfaces allows the Panda array I/O
  library to produce an efficient implementation of the collective I/O request.
  The use of these high level interfaces also increases the portability of the
  application. \par This thesis presents an efficient and portable
  implementation of the Panda array I/O library. In this implementation,
  standard software components are used to build the I/O library to aid its
  portability. The implementation also provides a simple, flexible framework
  for the implementation and integration of the various collective I/O
  strategies. The server directed I/O and the reduced messages server directed
  I/O algorithms are implemented in the Panda array I/O library. This
  implementation supports the sharing of the I/O servers between multiple
  applications by extending the collective I/O strategies. Also, the
  implementation supports the use of part time I/O nodes where certain
  designated compute nodes act as the I/O servers during the I/O phase of the
  application. The performance of this implementation of the Panda array I/O
  library is measured on the IBM SP2 and the performance results show that for
  read and write operations, the collective I/O strategies used by the Panda
  array I/O library achieve throughputs close to the maximum throughputs
  provided by the underlying file system on each I/O node of the IBM SP2.}
}

@Article{sun:dynamic,
  author = {Weitao T. Sun and Jiwu W. Shu and Weimin M. Zheng},
  title = {Dynamic file allocation in Storage Area Networks with neural network
  prediction},
  journal = {Lecture Notes in Computer Science},
  booktitle = {International Symposium on Neural Networks (ISSN 2004); August
  19-21, 2004; Dalian, PEOPLES R CHINA},
  editor = {Yin, FL; Wang, J; Guo, CG},
  year = {2004},
  month = {June},
  volume = {3174},
  pages = {719--724},
  institution = {Tsing Hua Univ, Dept Comp Sci \& Technol, Beijing 100084,
  Peoples R China},
  publisher = {Springer-Verlag Heidelberg},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL = {http://www.springerlink.com/link.asp?id=7t97qycr7awnbw6j},
  keywords = {SAN, dynamic data reorganization, neural network, access pattern
  prediction, pario-bib},
  abstract = {Disk arrays are widely used in Storage Area Networks (SANs) to
  achieve mass storage capacity and high level I/O parallelism. Data
  partitioning and distribution among the disks is a promising approach to
  minimize the file access time and balance the I/O workload. But disk I/O
  parallelism by itself does not guarantee the optimal performance of an
  application. The disk access rates fluctuate with time because of access
  pattern variations, which leads to a workload imbalance. The user access
  pattern prediction is of great importance to dynamic data reorganization
  between hot and cool disks. Data migration occurs according to current and
  future disk allocation states and access frequencies. The objective of this
  paper is to develop a neural network based disk allocation trend prediction
  method and optimize the disks' file capacity to their balanced level. A
  Levenberg-Marquardt neural network was adopted to predict the disk access
  frequencies with the I/O track. History. Data reorganization on disk arrays
  was optimized to provide a good workload balance. The simulation results
  proved that the proposed method performs well.}
}

@Unpublished{taber:metadisk,
  author = {David Taber},
  title = {{MetaDisk} Driver Technical Description},
  year = {1990},
  month = {October},
  note = {SunFlash electronic mailing list 22(9)},
  keywords = {disk mirroring, parallel I/O, pario-bib},
  comment = {MetaDisk is a addition to the Sun SPARCstation server kernel. It
  allows disk mirroring between any two local disk partitions, or concatenation
  of several disk partitions into one larger partition. Can span up to 4
  partitions simultaneously. Appears not to be striped, just allows bigger
  partitions, and (by chance) some parallel I/O for large files.}
}

@Article{takahashi:performance,
  author = {Naoua Takahashi and Yasuo Kurosu},
  title = {Performance improvement of disk array subsystems having shared cache
  and control memories},
  journal = {Transactions of the Institute of Electronics, Information and
  Communication Engineers D-I},
  year = {2003},
  month = {June},
  volume = {J86-D-I},
  number = {6},
  pages = {375--388},
  publisher = {Japan : Inst. Electron. Inf. \& Commun. Eng, 2003},
  copyright = {(c)2004 IEE},
  URL =
  {http://www3.interscience.wiley.com/cgi-bin/abstract/108561017/ABSTRACT},
  keywords = {disk array, star network topology, shared cache, pario-bib},
  abstract = {Disk array subsystems have serious demands for higher speed and
  greater number of channels along with the trends in improving operational
  efficiency of information system by integrating its storage subsystems.
  Conventional disk array subsystem employs a bus-structured connection between
  its microprocessors and shared cache and control memories. In general, a
  network-structured connection can be faster as compared with a bus-structured
  one although a switch causes higher latency. In this paper we propose a
  hybrid star-net connection consisting of a hierarchically switched star
  fan-out for cache memory and a direct star fan-out for control memory, where
  cache is used as a temporary store of host data, and control memory stores
  various control data including cache control tables. The latter requires more
  speed than the former. Based on the proposed connection, we developed a disk
  array subsystem with host interface having 32 channels, and evaluated its
  performance. We could attain sequential performance of 920MB/s and
  transaction performance of 160KIO/s. In comparison to the conventional disk
  array subsystem, the former is 5 times, and the latter is 2.5 times better.
  (12 refs.)}
}

@Article{talia:data-intensive,
  author = {Domenico Talia and Pradip K. Srimani},
  title = {Parallel data-intensive algorithms and applications},
  journal = {Parallel Computing},
  year = {2002},
  month = {May},
  volume = {28},
  number = {5},
  pages = {669--671},
  publisher = {Elsevier Science},
  URL = {http://www.elsevier.com/gej-ng/10/35/21/60/57/27/abstract.html},
  keywords = {parallel application, parallel I/O, pario-bib},
  comment = {guest editorial, no abstract}
}

@TechReport{tan:pizzas,
  author = {Michael Tan and Nick Roussopoulos and Steve Kelley},
  title = {The {Tower of Pizzas}},
  year = {1995},
  month = {April},
  number = {UMIACS-TR-95-52},
  institution = {University of Maryland Institute for Advanced Computer Studies
  (UMIACS)},
  URL = {ftp://ftp.cs.umd.edu/pub/papers/papers/3462/3462.ps.Z},
  keywords = {parallel I/O, pario-bib},
  abstract = {CPU speeds are increasing at a much faster rate than secondary
  storage device speeds. Many important applications face an I/O bottleneck. We
  demonstrate that this bottleneck can be alleviated through 1) scalable
  striping of data and 2) caching/prefetching techniques. This paper describes
  the design and performance of the Tower of Pizzas (TOPs), a portable software
  system providing parallel I/O and buffering services.},
  comment = {Same as CS-TR-3462 from Department of Computer Science. Basically,
  a parallel file system for a workstation cluster using the usual parallel
  file-system ideas. They do support client-side caching, using a client-side
  server process which shares memory with the client. Otherwise nothing really
  new.}
}

@Article{taylor:magic,
  author = {Herb Taylor and Danny Chin and Stan Knight},
  title = {The {Magic} Video-on-Demand Server and Real-Time Simulation System},
  journal = {IEEE Parallel and Distributed Technology},
  year = {1995},
  month = {Summer},
  volume = {3},
  number = {2},
  pages = {40--51},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, multimedia, video on demand, pario-bib},
  comment = {They describe a video server system being developed at the Sarnoff
  Real Time Corporation. This paper describes their simulated system. It is
  intended as more than a video-on-demand system, but also for capture and
  processing as well as playback. So they have a complex system of
  interconnected SIMD boards, each with a high-speed link to various devices,
  including a collection of disk drives. Data is striped across disks. They
  integrate playback scheduling and the disk striping in an interesting way.}
}

@TechReport{tennenhouse:debug,
  author = {Marsha Tennenhouse and Dror Zernik},
  title = {Visual Debugging of Parallel File System Programs},
  year = {1995},
  month = {March},
  institution = {IBM},
  URL = {http://www.almaden.ibm.com/watson/pv/vestaabs.html},
  keywords = {debugging, visualization, parallel file system, parallel I/O,
  pario-bib}
}

@InCollection{tewari:bhigh,
  author = {Renu Tewari and Daniel M. Dias and Rajat Mukherjee and Harrick M.
  Vin},
  title = {High Availability in Clustered Multimedia Servers},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {38},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {555--565},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {tewari:high},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {cluster, parallel I/O, pario-bib},
  comment = {Part of jin:io-book; reformatted version of tewari:high.}
}

@InProceedings{tewari:high,
  author = {Renu Tewari and Daniel M. Dias and Rajat Mukherjee and Harrick M.
  Vin},
  title = {High Availability in Clustered Multimedia Servers},
  booktitle = {Proceedings of the Twelfth International Conference on Data
  Engineering},
  year = {1996},
  pages = {645--654},
  later = {tewari:bhigh},
  URL =
  {http://ieeexplore.ieee.org:80/xpl/tocresult.jsp?isNumber=10642&page=5},
  keywords = {cluster, parallel I/O, pario-bib},
  abstract = {Clustered multimedia servers, consisting of interconnected nodes
  and disks, have been proposed for large-scale servers that are capable of
  supporting multiple concurrent streams which access the video objects stored
  in the server. As the number of disks and nodes in the cluster increases, so
  does the probability of a failure. With data striped across all disks in a
  cluster, the failure of a single disk or node results in the disruption of
  many or all streams in the system. Guaranteeing high availability in such a
  cluster becomes a primary requirement to ensure continuous service. In this
  paper, we study mirroring and software RAID schemes with different placement
  strategies that guarantee high availability in the event of disk and node
  failures while satisfying the real-time requirements of the streams. We
  examine various declustering techniques for spreading the redundant
  information across disks and nodes and show that random declustering has good
  real-time performance. Finally, we compare the overall cost per stream for
  different system configurations. We derive the parameter space where
  mirroring and software RAID apply, and determine optimal parity group sizes}
}

@InProceedings{thakur:abstract,
  author = {Rajeev Thakur and William Gropp and Ewing Lusk},
  title = {An Abstract-Device Interface for Implementing Portable
  Parallel-{I/O} Interfaces},
  booktitle = {Proceedings of the Sixth Symposium on the Frontiers of Massively
  Parallel Computation},
  year = {1996},
  month = {October},
  pages = {180--187},
  earlier = {thakur:abstract-tr},
  URL = {http://www.mcs.anl.gov/~thakur/papers/adio.ps},
  keywords = {parallel I/O, multiprocessor file system interface, pario-bib},
  abstract = {In this paper, we propose a strategy for implementing
  parallel-I/O interfaces portably and efficiently. We have defined an
  abstract-device interface for parallel I/O, called ADIO. Any parallel-I/O API
  can be implemented on multiple file systems by implementing the API portably
  on top of ADIO, and implementing only ADIO on different file systems. This
  approach simplifies the task of implementing an API and yet exploits the
  specific high-performance features of individual file systems. We have used
  ADIO to implement the Intel PFS interface and subsets of MPI-IO and IBM PIOFS
  interfaces on PFS, PIOFS, Unix, and NFS file systems. Our performance studies
  indicate that the overhead of using ADIO as an implementation strategy is
  very low.}
}

@TechReport{thakur:abstract-tr,
  author = {Rajeev Thakur and William Gropp and Ewing Lusk},
  title = {An Abstract-Device Interface for Implementing Portable
  Parallel-{I/O} Interfaces},
  year = {1996},
  month = {May},
  number = {MCS-P592-0596},
  institution = {Argonne National Laboratory, Mathematics and Computer Science
  Division},
  later = {thakur:abstract},
  URL = {http://www.mcs.anl.gov/~thakur/papers/adio.ps},
  keywords = {multiprocessor file system interface, parallel I/O, pario-bib},
  comment = {They propose an intermediate interface that can serve as an
  implementation base for all parallel file-system APIs, and which can itself
  be implemented on top of all parallel file systems. This ``universal''
  interface allows all apps to run on all file systems with no porting, and for
  people to experiment with different APIs.}
}

@Article{thakur:applications,
  author = {Rajeev Thakur and Ewing Lusk and William Gropp},
  title = {{I/O} in Parallel Applications: The Weakest Link},
  journal = {The International Journal of High Performance Computing
  Applications},
  year = {1998},
  month = {Winter},
  volume = {12},
  number = {4},
  pages = {389--395},
  note = {In a Special Issue on I/O in Parallel Applications},
  URL = {http://www.mcs.anl.gov/~thakur/papers/ijsa-article.ps},
  keywords = {parallel I/O application, pario-bib},
  abstract = {Parallel computers are increasingly being used to run large-scale
  applications that also have huge I/O requirements. However, many applications
  obtain poor I/O performance on modern parallel machines. This special issue
  of IJSA contains papers that describe the I/O requirements and the techniques
  used to perform I/O in real parallel applications. We first explain how the
  I/O application program interface (API) plays a critical role in enabling
  such applications to achieve high I/O performance. We describe how the
  commonly used Unix I/O interface is inappropriate for parallel I/O and how an
  explicitly parallel API with support for collective I/O can help the
  underlying I/O hardware and software perform I/O efficiently. We then
  describe MPI-IO, a recently defined, standard, portable API specifically
  designed for high-performance parallel I/O. We conclude with an overview of
  the papers in this special issue.}
}

@TechReport{thakur:astrophysics,
  author = {Rajeev Thakur and Ewing Lusk and William Gropp},
  title = {{I/O} Characterization of a Portable Astrophysics Application on the
  {IBM SP} and {Intel Paragon}},
  year = {1995},
  month = {August},
  number = {MCS-P534-0895},
  institution = {Argonne National Laboratory},
  note = {Revised October 1995},
  URL = {http://www.mcs.anl.gov/~thakur/papers/astro.ps},
  keywords = {file access pattern, workload characterization, parallel I/O,
  pario-bib},
  abstract = {Many large-scale applications on parallel machines are
  bottlenecked by the I/O performance rather than the CPU or communication
  performance of the system. To improve the I/O performance, it is first
  necessary for system designers to understand the I/O requirements of various
  applications. This paper presents the results of a study of the I/O
  characteristics and performance of a real, I/O-intensive, portable, parallel
  application in astrophysics, on two different parallel machines---the IBM SP
  and the Intel Paragon. We instrumented the source code to record all I/O
  activity, and analyzed the resulting trace files. Our results show that, for
  this application, the I/O consists of fairly large writes, and writing data
  to files is faster on the Paragon, whereas opening and closing files are
  faster on the SP. We also discuss how the I/O performance of this application
  could be improved; particularly, we believe that this application would
  benefit from using collective I/O.},
  comment = {Adds another data point to the collection of parallel scientific
  applications whose I/O has been characterized, a collection started in
  earnest by crandall:iochar. It's a pretty straightforward application; it
  just writes its matrices every few timesteps. The application writes whole
  matrices; the OS sees request sizes that are more a factor of the Chameleon
  library than of the application. Most of the I/O itself is not implemented in
  parallel, because they used UniTree on the SP, and because the Chameleon
  library sequentializes this kind of I/O through one node. Other numbers from
  the paper don't add much insight into the workload. Revised slightly in
  October 1995; the abstract represents that revision.}
}

@InProceedings{thakur:evaluation,
  author = {Rajeev Thakur and William Gropp and Ewing Lusk},
  title = {An Experimental Evaluation of the Parallel {I/O} Systems of the
  {IBM~SP} and {Intel Paragon} Using a Production Application},
  booktitle = {Proceedings of the Third International Conference of the
  Austrian Center for Parallel Computation (ACPC)},
  year = {1996},
  month = {September},
  series = {Lecture Notes in Computer Science},
  volume = {1127},
  pages = {24--35},
  publisher = {Springer-Verlag},
  earlier = {thakur:evaluation-tr},
  URL = {http://www.mcs.anl.gov/~thakur/papers/io-eval.ps},
  keywords = {parallel I/O, multiprocessor file system, workload
  characterization, pario-bib},
  abstract = {We present the results of an experimental evaluation of the
  parallel I/O systems of the IBM SP and Intel Paragon using a real
  three-dimensional parallel application code. This application, developed by
  scientists at the University of Chicago, simulates the gravitational collapse
  of self-gravitating gaseous clouds. It performs parallel I/O by using library
  routines that we developed and optimized separately for the SP and Paragon.
  The I/O routines perform two-phase I/O and use the parallel file systems
  PIOFS on the SP and PFS on the Paragon. We studied the I/O performance for
  two different sizes of the application. In the small case, we found that I/O
  was much faster on the SP. In the large case, open, close, and read
  operations were only slightly faster, and seeks were significantly faster, on
  the SP; whereas, writes were slightly faster on the Paragon. The
  communication required within our I/O routines was faster on the Paragon in
  both cases. The highest read bandwidth obtained was 48\,Mbytes/sec., and the
  highest write bandwidth obtained was 31.6\,Mbytes/sec., both on the SP.}
}

@TechReport{thakur:evaluation-tr,
  author = {Rajeev Thakur and William Gropp and Ewing Lusk},
  title = {An Experimental Evaluation of the Parallel {I/O} Systems of the
  {IBM~SP} and {Intel Paragon} Using a Production Application},
  year = {1996},
  month = {February},
  number = {MCS-P569--0296},
  institution = {Argonne National Laboratory},
  later = {thakur:evaluation},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  abstract = {This paper presents the results of an experimental evaluation of
  the parallel I/O systems of the IBM SP and Intel Paragon. For the evaluation,
  we used a full, three-dimensional application code that is in production use
  for studying the nonlinear evolution of Jeans instability in self-gravitating
  gaseous clouds. The application performs I/O by using library routines that
  we developed and optimized separately for parallel I/O on the SP and Paragon.
  The I/O routines perform two-phase I/O and use the PIOFS file system on the
  SP and PFS on the Paragon. We studied the I/O performance for two different
  sizes of the application. We found that for the small case, I/O was faster on
  the SP, whereas for the large case, I/O took almost the same time on both
  systems. Communication required for I/O was faster on the Paragon in both
  cases. The highest read bandwidth obtained was 48 Mbytes/sec. and the highest
  write bandwidth obtained was 31.6 Mbytes/sec., both on the SP.},
  comment = {This version no longer on the web.}
}

@TechReport{thakur:ext2phase,
  author = {Rajeev Thakur and Alok Choudhary},
  title = {Accessing Sections of Out-of-Core Arrays Using an Extended Two-Phase
  Method},
  year = {1995},
  month = {January},
  number = {SCCS-685},
  institution = {NPAC},
  address = {Syracuse University},
  later = {thakur:ext2phase2},
  keywords = {parallel I/O, pario-bib},
  abstract = {In out-of-core computations, data needs to be moved back and
  forth between main memory and disks during program execution. In this paper,
  we propose a technique called the Extended Two-Phase Method, for accessing
  sections of out-of-core arrays efficiently. This is an extension and
  generalization of the Two-Phase Method for reading in-core arrays from files,
  which was previously proposed in [Rosario93,Bordawekar93]. The Extended
  Two-Phase Method uses collective I/O in which all processors cooperate to
  perform I/O in an efficient manner by combining several I/O requests into
  fewer larger requests, eliminating multiple disk accesses for the same data
  and reducing contention for disks. We describe the algorithms for reading as
  well as writing array sections. Performance results on the Intel Touchstone
  Delta for many different access patterns are presented and analyzed. It is
  observed that the Extended Two-Phase Method gives consistently good
  performance over a wide range of access patterns.},
  comment = {Revised as thakur:ext2phase2 and thakur:jext2phase.}
}

@TechReport{thakur:ext2phase2,
  author = {Rajeev Thakur and Alok Choudhary},
  title = {An Extended Two-Phase Method for Accessing Sections of Out-of-Core
  Arrays},
  year = {1995},
  month = {June},
  number = {CACR-103},
  institution = {Scalable I/O Initiative},
  address = {Center for Advanced Computing Research, Caltech},
  note = {Revised November 1995},
  earlier = {thakur:ext2phase},
  later = {thakur:jext2phase},
  URL = {http://www.mcs.anl.gov/~thakur/papers/cacr-103.ps},
  keywords = {parallel I/O, pario-bib},
  abstract = {A number of applications on parallel computers deal with very
  large data sets which cannot fit in main memory. In such cases, data must be
  stored in files on disks and fetched into main memory during program
  execution. In programs with large out-of-core arrays stored in files, it is
  necessary to read/write smaller sections of the arrays from/to files. This
  paper describes a method, called the {\em extended two-phase method}, for
  accessing sections of out-of-core arrays in an efficient manner. This method
  uses collective I/O in which processors cooperate to combine several I/O
  requests into fewer larger granularity requests, reorder requests so that the
  file is accessed in proper sequence, and eliminate simultaneous I/O requests
  for the same data. The I/O workload is divided among processors dynamically,
  depending on the access requests. We present performance results for two
  real, out-of-core, parallel applications --- matrix multiplication and a
  Laplace's equation solver --- and several synthetic access patterns. The
  results indicate that the extended two-phase method provides a significant
  performance improvement over a direct method for I/O.},
  comment = {Revised version of thakur:ext2phase. The tech report was itself
  revised in November 1995; the abstract represents that revision.}
}

@Article{thakur:jext2phase,
  author = {Rajeev Thakur and Alok Choudhary},
  title = {{An Extended Two-Phase Method for Accessing Sections of Out-of-Core
  Arrays}},
  journal = {Scientific Programming},
  year = {1996},
  month = {Winter},
  volume = {5},
  number = {4},
  pages = {301--317},
  earlier = {thakur:ext2phase2},
  URL = {http://www.mcs.anl.gov/~thakur/papers/ext2ph.ps},
  keywords = {parallel I/O, pario-bib},
  abstract = {A number of applications on parallel computers deal with very
  large data sets that cannot fit in main memory. In such applications, data
  must be stored in files on disks and fetched into memory during program
  execution. Parallel programs with large out-of-core arrays stored in files
  must read/write smaller sections of the arrays from/to files. In this
  article, we describe a method for accessing sections of out-of-core arrays
  efficiently. Our method, the extended two-phase method, uses collective I/O:
  Processors cooperate to combine several I/O requests into fewer larger
  granularity requests, reorder requests so that the file is accessed in proper
  sequence, and eliminate simultaneous I/O requests for the same data. In
  addition, the I/O workload is divided among processors dynamically, depending
  on the access requests. We present performance results obtained from two real
  out-of-core parallel applications---matrix multiplication and a Laplace's
  equation solver---and several synthetic access patterns, all on the Intel
  Touchstone Delta. These results indicate that the extended two-phase method
  significantly outperformed a direct (noncollective) method for accessing
  out-of-core array sections.}
}

@Article{thakur:jpassion,
  author = {Rajeev Thakur and Alok Choudhary and Rajesh Bordawekar and Sachin
  More and Sivaramakrishna Kuditipudi},
  title = {Passion: Optimized {I/O} for Parallel Applications},
  journal = {IEEE Computer},
  year = {1996},
  month = {June},
  volume = {29},
  number = {6},
  pages = {70--78},
  publisher = {IEEE Computer Society Press},
  URL = {http://www.computer.org/computer/co1996/r6070abs.htm},
  keywords = {parallel I/O, pario-bib},
  abstract = {Parallel computers with peak performance of more than 100
  Gflops/second are already available to solve a variety of problems in a range
  of disciplines. However, the input/output performance of these machines is a
  poor reflection of their true computational power. \par To improve the I/O
  performance of parallel programs with distributed multidimensional arrays, we
  have developed a software library called Passion (Parallel, Scalable Software
  for Input/Output). Passion's routines are designed to read or write either
  entire distributed arrays or sections of such arrays. Passion also frees the
  programmer from many of the tedious tasks associated with performing I/O in
  parallel programs and has a high-level interface that makes it easy to
  specify the required I/O. \par We have implemented Passion on Intel's
  Paragon, Touchstone Delta, and iPSC/860 systems, and on the IBM SP system. We
  have also made it publicly available through the World Wide Web
  (http://www.cat.syr.edu/passion.html). We are in the process of porting the
  library to other machines and extending its functionality.},
  comment = {See thakur:passion, choudhary:passion.}
}

@InProceedings{thakur:mpi,
  author = {Rajeev Thakur and William Gropp and Ewing Lusk},
  title = {A Case for Using {MPI's} Derived Datatypes to Improve {I/O}
  Performance},
  booktitle = {Proceedings of SC98: High Performance Networking and Computing},
  year = {1998},
  month = {November},
  publisher = {ACM Press},
  earlier = {thakur:mpi-tr},
  URL = {http://www.mcs.anl.gov/~thakur/dtype/},
  keywords = {MPI, parallel I/O, pario-bib},
  abstract = {MPI-IO, the I/O part of the MPI-2 standard, is a promising new
  interface for parallel I/O. A key feature of MPI-IO is that it allows users
  to access several noncontiguous pieces of data from a file with a single I/O
  function call by defining file views with derived datatypes. We explain how
  critical this feature is for high performance, why users must create and use
  derived datatypes whenever possible, and how it enables implementations to
  perform optimizations. In particular, we describe two optimizations our
  MPI-IO implementation, ROMIO, performs: data sieving and collective I/O. We
  demonstrate the performance and portability of the approach with performance
  results on five different parallel machines: HP Exemplar, IBM SP, Intel
  Paragon, NEC SX-4, and SGI Origin2000.}
}

@InProceedings{thakur:mpi-io-implement,
  author = {Rajeev Thakur and William Gropp and Ewing Lusk},
  title = {On Implementing {MPI-IO} Portably and with High Performance},
  booktitle = {Proceedings of the Sixth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1999},
  month = {May},
  pages = {23--32},
  earlier = {thakur:mpi-io-implement-tr},
  URL = {http://www.mcs.anl.gov/~thakur/papers/mpio-impl.ps},
  keywords = {parallel I/O, multiprocessor file system interface, pario-bib},
  abstract = {We discuss the issues involved in implementing MPI-IO portably on
  multiple machines and file systems and also achieving high performance. One
  way to implement MPI-IO portably is to implement it on top of the basic Unix
  I/O functions ({\tt open}, {\tt lseek}, {\tt read}, {\tt write}, and {\tt
  close}), which are themselves portable. We argue that this approach has
  limitations in both functionality and performance. We instead advocate an
  implementation approach that combines a large portion of portable code and a
  small portion of code that is optimized separately for different machines and
  file systems. We have used such an approach to develop a high-performance,
  portable MPI-IO implementation, called ROMIO. \par In addition to basic I/O
  functionality, we consider the issues of supporting other MPI-IO features,
  such as 64-bit file sizes, noncontiguous accesses, collective I/O,
  asynchronous I/O, consistency and atomicity semantics, user-supplied hints,
  shared file pointers, portable data representation, and file preallocation.
  We describe how we implemented each of these features on various machines and
  file systems. The machines we consider are the HP Exemplar, IBM SP, Intel
  Paragon, NEC SX-4, SGI Origin2000, and networks of workstations; and the file
  systems we consider are HP HFS, IBM PIOFS, Intel PFS, NEC SFS, SGI XFS, NFS,
  and any general Unix file system (UFS). \par We also present our thoughts on
  how a file system can be designed to better support MPI-IO. We provide a list
  of features desired from a file system that would help in implementing MPI-IO
  correctly and with high performance.}
}

@TechReport{thakur:mpi-io-implement-tr,
  author = {Rajeev Thakur and William Gropp and Ewing Lusk},
  title = {On Implementing {MPI-IO} Portably and with High Performance},
  year = {1998},
  month = {October},
  number = {ANL/MCS-P732-1098},
  institution = {Mathematics and Computer Science Division, Argonne National
  Laboratory},
  later = {thakur:mpi-io-implement},
  URL = {http://www.mcs.anl.gov/~thakur/papers/mpio-impl.ps},
  keywords = {parallel I/O, multiprocessor file system interface, pario-bib},
  abstract = {We discuss the issues involved in implementing MPI-IO portably on
  multiple machines and file systems and also achieving high performance. One
  way to implement MPI-IO portably is to implement it on top of the basic Unix
  I/O functions (open, lseek, read, write, and close), which are themselves
  portable. We argue that this approach has limitations in both functionality
  and performance. We instead advocate an implementation approach that combines
  a large portion of portable code and a small portion of code that is
  optimized separately for different machines and file systems. We have used
  such an approach to develop a high-performance, portable MPI-IO
  implementation, called ROMIO.\par In addition to basic I/O functionality, we
  consider the issues of supporting other MPI-IO features, such as 64-bit file
  sizes, noncontiguous accesses, collective I/O, asynchronous I/O, consistency
  and atomicity semantics, user-supplied hints, shared file pointers, portable
  data representation, file preallocation, and some miscellaneous features. We
  describe how we implemented each of these features on various machines and
  file systems. The machines we consider are the HP Exemplar, IBM SP, Intel
  Paragon, NEC SX-4, SGI Origin2000, and networks of workstations; and the file
  systems we consider are HP HFS, IBM PIOFS, Intel PFS, NEC SFS, SGI XFS, NFS,
  and any general Unix file system (UFS). \par We also present our thoughts on
  how a file system can be designed to better support MPI-IO. We provide a list
  of features desired from a file system that would help in implementing MPI-IO
  correctly and with high performance.}
}

@TechReport{thakur:mpi-tr,
  author = {Rajeev Thakur and William Gropp and Ewing Lusk},
  title = {A Case for Using {MPI's} Derived Datatypes to Improve {I/O}
  Performance},
  year = {1998},
  month = {May},
  number = {ANL/MCS-P717-0598},
  institution = {Mathematics and Computer Science Division, Argonne National
  Laboratory},
  later = {thakur:mpi},
  URL = {http://www.mcs.anl.gov/~thakur/dtype/},
  keywords = {MPI, parallel I/O, pario-bib},
  abstract = {MPI-IO, the I/O part of the MPI-2 standard, is a promising new
  interface for parallel I/O. A key feature of MPI-IO is that it allows users
  to access several noncontiguous pieces of data from a file with a single I/O
  function call by defining file views with derived datatypes. We explain how
  critical this feature is for high performance, why users must create and use
  derived datatypes whenever possible, and how it enables implementations to
  perform optimizations. In particular, we describe two optimizations our
  MPI-IO implementation, ROMIO, performs: data sieving and collective I/O. We
  present performance results on five different parallel machines: HP Exemplar,
  IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.}
}

@Article{thakur:noncontigous,
  author = {Rajeev Thakur and William Gropp and Ewing Lusk},
  title = {Optimizing Noncontiguous Accesses in {MPI-IO}},
  journal = {Parallel Computing},
  year = {2002},
  month = {January},
  volume = {28},
  number = {1},
  pages = {83--105},
  URL = {http://www.mcs.anl.gov/~thakur/papers/mpi-io-noncontig.ps},
  keywords = {parallel I/O, parallel I/O, MPI-IO, collective I/O, data sieving,
  pario-bib},
  abstract = {The I/O access patterns of many parallel applications consist of
  accesses to a large number of small, noncontiguous pieces of data. If an
  application's I/O needs are met by making many small, distinct I/O requests,
  however, the I/O performance degrades drastically. To avoid this problem,
  MPI-IO allows users to access noncontiguous data with a single I/O function
  call, unlike in Unix I/O. In this paper, we explain how critical this feature
  of MPI-IO is for high performance and how it enables implementations to
  perform optimizations. We first provide a classification of the different
  ways of expressing an application's I/O needs in MPI-IO---we classify them
  into four {\em levels}, called level~0 through level~3. We demonstrate that,
  for applications with noncontiguous access patterns, the I/O performance
  improves dramatically if users write their applications to make level-3
  requests (noncontiguous, collective) rather than level-0 requests (Unix
  style). We then describe how our MPI-IO implementation, ROMIO, delivers high
  performance for noncontiguous requests. We explain in detail the two key
  optimizations ROMIO performs: data sieving for noncontiguous requests from
  one process and collective I/O for noncontiguous requests from multiple
  processes. We describe how we have implemented these optimizations portably
  on multiple machines and file systems, controlled their memory requirements,
  and also achieved high performance. We demonstrate the performance and
  portability with performance results for three applications---an
  astrophysics-application template (DIST3D), the NAS BTIO benchmark, and an
  unstructured code (UNSTRUC)---on five different parallel machines: HP
  Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.}
}

@InProceedings{thakur:out-of-core,
  author = {Rajeev Thakur and Rajesh Bordawekar and Alok Choudhary},
  title = {Compilation of Out-Of-Core Data Parallel Programs for Distributed
  Memory Machines},
  booktitle = {Proceedings of the IPPS~'94 Workshop on Input/Output in Parallel
  Computer Systems},
  year = {1994},
  month = {April},
  pages = {54--72},
  organization = {Syracuse University},
  note = {Also appeared in Computer Architecture News 22(4)},
  later = {thakur:out-of-core-book},
  keywords = {parallel I/O, pario-bib},
  comment = {Earlier version available as NPAC/Syracuse tech report. They
  describe the design of an HPF compiler that can translate out-of-core
  programs into plain programs with explicit I/O. For the most part, they
  discuss many of the issues involved in manipulating the arrys, and some of
  the alternatives for run-time support. The out-of-core array is broken into
  pieces, one per processor. Each processor keeps its local array piece in a
  file on its own logical disk, and reads and writes pieces of that file as
  needed. Some of the tradeoffs appear to contrast the amount of I/O with the
  ability to optimize communication: they choose a method called ``out-of-core
  communication'' because it simplifies the analysis of communication patterns,
  although it requires more I/O. The compiler depends on run-time routines for
  support; the run-time routines hide a lot of the architectural details,
  simplifying the job of the compiler and making the resulting program more
  portable. There are some preliminary performance numbers.}
}

@InCollection{thakur:out-of-core-book,
  author = {Rajeev Thakur and Alok Choudhary},
  title = {Runtime Support for Out-of-Core Parallel Programs},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {6},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {147--165},
  publisher = {Kluwer Academic Publishers},
  earlier = {thakur:out-of-core},
  keywords = {parallel I/O, out-of-core, pario-bib},
  abstract = {In parallel programs with large out-of-core arrays stored in
  files, it is necessary to read/write smaller sections of the arrays from/to
  files. We describe a runtime method for accessing sections of out-of-core
  arrays efficiently. This method, called the {\em extended two-phase method},
  uses collective I/O in which processors cooperate to read/write out-of-core
  data in an efficient manner. The I/O workload is divided among processors
  dynamically, depending on the access requests. Performance results on the
  Intel Touchstone Delta show that the extended two-phase method performs
  considerably better than a direct method for different access patterns, array
  sizes, and number of processors. We have used the extended two-phase method
  in the PASSION runtime library for parallel I/O.},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@InProceedings{thakur:passion,
  author = {Rajeev Thakur and Rajesh Bordawekar and Alok Choudhary and Ravi
  Ponnusamy and Tarvinder Singh},
  title = {{PASSION} Runtime Library for Parallel {I/O}},
  booktitle = {Proceedings of the Scalable Parallel Libraries Conference},
  year = {1994},
  month = {October},
  pages = {119--128},
  later = {thakur:jpassion},
  URL =
  {ftp://erc.cat.syr.edu/ece/choudhary/PASSION/splc94_passion_runtime.ps.Z},
  keywords = {parallel I/O, pario-bib},
  abstract = {We are developing a compiler and runtime support system called
  PASSION: Parallel And Scalable Software for Input-Output. PASSION provides
  software support for I/O intensive out-of-core loosely synchronous problems.
  This paper gives an overview of the PASSION Runtime Library and describes two
  of the optimizations incorporated in it, namely Data Prefetching and Data
  Sieving. Performance improvements provided by these optimizations on the
  Intel Touchstone Delta are discussed, together with an out-of-core Median
  Filtering application.},
  comment = {See thakur:jpassion. They describe the PASSION library for
  parallel I/O, though the description is fairly high-level. The main things
  that this paper adds to earlier papers from this group is a discussion of
  Data Prefetching (which is really just an asynchronous I/O interface that
  their compiler uses for prefetching) and Data Sieving, which they use when
  the application needs to read some array section that is not contiguous in
  the file; for example, a submatrix of a 2-d matrix from in a file stored
  row-major. Their solution is to read the complete set of rows (or columns,
  depending on file layout) in one huge read, into a memory buffer, and then
  extract the necessary data. Basically, this is another form of the two-phase
  strategy.}
}

@InProceedings{thakur:romio,
  author = {Rajeev Thakur and William Gropp and Ewing Lusk},
  title = {Data Sieving and Collective {I/O} in {ROMIO}},
  booktitle = {Proceedings of the Seventh Symposium on the Frontiers of
  Massively Parallel Computation},
  year = {1999},
  month = {February},
  pages = {182--189},
  publisher = {IEEE Computer Society Press},
  earlier = {thakur:romio-tr},
  URL = {http://www.mcs.anl.gov/~thakur/papers/romio-coll.ps},
  keywords = {parallel I/O, collective I/O, application programmer interface,
  pario-bib},
  abstract = {The I/O access patterns of parallel programs often consist of
  accesses to a large number of small, noncontiguous pieces of data. If an
  application's I/O needs are met by making many small, distinct I/O requests,
  however, the I/O performance degrades drastically. To avoid this problem,
  MPI-IO allows users to access a noncontiguous data set with a single I/O
  function call. This feature provides MPI-IO implementations an opportunity to
  optimize data access. \par We describe how our MPI-IO implementation, ROMIO,
  delivers high performance in the presence of noncontiguous requests. We
  explain in detail the two key optimizations ROMIO performs: data sieving for
  noncontiguous requests from one process and collective I/O for noncontiguous
  requests from multiple processes. We describe how one can implement these
  optimizations portably on multiple machines and file systems, control their
  memory requirements, and also achieve high performance. We demonstrate the
  performance and portability with performance results for three
  applications---an astrophysics-application template (DIST3D), the NAS BTIO
  benchmark, and an unstructured code (UNSTRUC)---on five different parallel
  machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.},
  comment = {They describe how ROMIO, their MPI-IO implementation, delivers
  high performance through the use of data sieving and collective I/O. The
  paper discusses several specific optimizations. They have results from five
  major parallel platforms. The paper confirms that the UNIX interface is
  terrible for many parallel access patterns, and that collective I/O is an
  important solution.}
}

@TechReport{thakur:romio-tr,
  author = {Rajeev Thakur and William Gropp and Ewing Lusk},
  title = {Data Sieving and Collective {I/O} in {ROMIO}},
  year = {1998},
  month = {August},
  number = {ANL/MCS-P723-0898},
  institution = {Mathematics and Computer Science Division, Argonne National
  Laboratory},
  later = {thakur:romio},
  URL = {http://www.mcs.anl.gov/~thakur/papers/romio-coll.ps},
  keywords = {parallel I/O, collective I/O, application programmer interface,
  pario-bib},
  abstract = {The I/O access patterns of parallel programs often consist of
  accesses to a large number of small, noncontiguous pieces of data. If an
  application's I/O needs are met by making many small, distinct I/O requests,
  however, the I/O performance degrades drastically. To avoid this problem,
  MPI-IO allows users to access a noncontiguous data set with a single I/O
  function call. This feature provides MPI-IO implementations an opportunity to
  optimize data access. We describe how our MPI-IO implementation, ROMIO,
  delivers high performance in the presence of noncontiguous requests. We
  explain in detail the two key optimizations ROMIO performs: data sieving for
  noncontiguous requests from one process and collective I/O for noncontiguous
  requests from multiple processes. We describe how one can implement these
  optimizations portably on multiple machines and file systems, control their
  memory requirements, and also achieve high performance. We demonstrate the
  performance and portability with performance results for three
  applications--- an astrophysics-application template (DIST3D), the NAS BTIO
  benchmark, and an unstructured code (UNSTRUC)--- on five different parallel
  machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.}
}

@TechReport{thakur:romio-users,
  author = {Rajeev Thakur and Ewing Lusk and William Gropp},
  title = {Users Guide for {ROMIO}: A High-Performance, Portable {MPI-IO}
  Implementation},
  year = {1997},
  month = {October},
  number = {ANL/MCS-TM-234},
  institution = {Mathematics and Computer Science Division, Argonne National
  Laboratory},
  URL = {ftp://ftp.mcs.anl.gov/pub/thakur/romio/users-guide.ps.gz},
  keywords = {file system interface, parallel I/O, pario-bib},
  abstract = {ROMIO is a high-performance, portable implementation of MPI-IO
  (the I/O chapter in MPI-2). This document describes how to install and use
  ROMIO version~1.0.0 on various machines.}
}

@InProceedings{thakur:runtime,
  author = {R. Thakur and R. Bordawekar and A. Choudhary},
  title = {{Compiler and Runtime Support for Out-of-Core HPF Programs}},
  booktitle = {Proceedings of the 8th ACM International Conference on
  Supercomputing},
  year = {1994},
  month = {July},
  pages = {382--391},
  publisher = {ACM Press},
  address = {Manchester, UK},
  URL =
  {ftp://erc.cat.syr.edu/ece/choudhary/PASSION/ics94-out-of-core-hpf.ps.Z},
  keywords = {parallel I/O, pario-bib},
  abstract = {This paper describes the design of a compiler which can translate
  out-of-core programs written in a data parallel language like HPF. Such a
  compiler is required for compiling large scale scientific applications, such
  as the Grand Challenge applications, which deal with enormous quantities of
  data. We propose a framework by which a compiler together with appropriate
  runtime support can translate an out-of-core HPF program to a message passing
  node program with explicit parallel I/O. We describe the basic model of the
  compiler and the various transformations made by the compiler. We also
  discuss the runtime routines used by the compiler for I/O and communication.
  In order to minimize I/O, the runtime support system can reuse data already
  fetched into memory. The working of the compiler is illustrated using two
  out-of-core applications, namely a Laplace equation solver and LU
  Decomposition, together with performance results on the Intel Touchstone
  Delta.},
  comment = {They describe ways to make HPF handle out-of-core arrays.
  Basically, they add directives to say which arrays are out of core, and how
  much memory to devote to the in-core portion of the array. Then the compiler
  distributes the array across processors, as in HPF, to form local arrays.
  Each local array is broken into slabs, where each slab can fit in local
  memory. The local array is kept in a local array file, from which slabs are
  loaded and stored. Ghost nodes are also handled. They were careful to avoid
  double I/O when one slab is another slab's ghost node. They found it most
  convenient to do all the communication between iterations, then do all the
  computation for that iteration, where the iteration itself required a loop
  including both computation and I/O. This means that there may need to be I/O
  during the communication phase, to store ghost nodes coming in from other
  places. They do not mention use of asynchronous I/O for overlap. See also
  bordawekar:efficient.}
}

@PhdThesis{thakur:thesis,
  author = {Rajeev Thakur},
  title = {{Runtime Support for In-Core and Out-of-Core Data-Parallel
  Programs}},
  year = {1995},
  month = {May},
  school = {Department of Electrical and Computer Engineering, Syracuse
  University},
  URL = {http://www.mcs.anl.gov/~thakur/papers/phd_thesis.ps},
  keywords = {parallel I/O, runtime library, pario-bib},
  abstract = {Distributed memory parallel computers or distributed computer
  systems are widely recognized as the only cost-effective means of achieving
  teraflops performance in the near future. However, the fact remains that they
  are difficult to program and advances in software for these machines have not
  kept pace with advances in hardware. This thesis addresses several issues in
  providing runtime support for in-core as well as out-of-core programs on
  distributed memory parallel computers. This runtime support can be directly
  used in application programs for greater efficiency, portability and ease of
  programming. It can also be used together with a compiler to translate
  programs written in a high-level data-parallel language like High Performance
  Fortran (HPF) to node programs for distributed memory machines. \par In
  distributed memory programs, it is often necessary to change the distribution
  of arrays during program execution. This thesis presents efficient and
  portable algorithms for runtime array redistribution. The algorithms have
  been implemented on the Intel Touchstone Delta and are found to scale well
  with the number of processors and array size. This thesis also presents
  algorithms for all-to-all collective communication on fat-tree and
  two-dimensional mesh interconnection topologies. The performance of these
  algorithms on the CM-5 and Touchstone Delta is studied extensively. A model
  for estimating the time taken by these algorithms on the basis of system
  parameters is developed and validated by comparing with experimental results.
  \par A number of applications deal with very large data sets which cannot fit
  in main memory, and hence have to be stored in files on disks, resulting in
  out-of-core programs. This thesis also describes the design and
  implementation of efficient runtime support for out-of-core computations.
  Several optimizations for accessing out-of-core data are presented. An
  Extended Two-Phase Method is proposed for accessing sections of out-of-core
  arrays efficiently. This method uses collective I/O and the I/O workload is
  divided among processors dynamically, depending on the access requests.
  Performance results obtained using this runtime support for out-of-core
  programs on the Touchstone Delta are presented.}
}

@TechReport{think:cm-2,
  key = {TMC},
  title = {{Connection Machine} Model {CM-2} Technical Summary},
  year = {1987},
  month = {April},
  number = {HA87-4},
  institution = {Thinking Machines},
  keywords = {parallel I/O, connection machine, disk array, disk architecture,
  SIMD, pario-bib},
  comment = {I/O and Data Vault, pp. 27--30}
}

@Book{think:cm5,
  key = {TMC},
  title = {The {Connection Machine} {CM-5} Technical Summary},
  year = {1991},
  month = {October},
  publisher = {Thinking Machines Corporation},
  keywords = {computer architecture, connection machine, MIMD, SIMD, parallel
  I/O, pario-bib},
  comment = {Some detail but still skips over some key aspects (like
  communication topology. Neat communications support makes for user-mode
  message-passing, broadcasting, reductions, all built in. Lots of info here.
  File system calls allows data to be transferred in parallel directly from I/O
  node to processing node, bypassing the partition and I/O management nodes.
  Multiple I/O devices (even DataVaults) can be logically striped. See also
  best:cmmdio, loverso:sfs, think:cmmd, think:sda.}
}

@Misc{think:cm5io,
  key = {TMC},
  title = {The {CM-5} {I/O} system},
  year = {1993},
  howpublished = {Thinking Machines Corporation glossy},
  keywords = {parallel I/O, disk array, striping, RAID, HIPPI, pario-bib},
  comment = {More detail about I/O nodes than think:sda, including info about
  disk storage nodes, HIPPI nodes, and tape nodes (ITS).}
}

@Manual{think:cmmd,
  key = {TMC},
  title = {{CMMD} User's Guide},
  year = {1992},
  month = {January},
  organization = {Thinking Machines Corporation},
  keywords = {MIMD, parallel programming, parallel I/O, message-passing,
  pario-bib}
}

@Misc{think:sda,
  key = {TMC},
  title = {{CM-5} Scalable Disk Array},
  year = {1992},
  month = {November},
  howpublished = {Thinking Machines Corporation glossy},
  keywords = {parallel I/O, disk array, striping, RAID, pario-bib},
  comment = {Disk storage nodes (processor, network interface, buffer, 4 SCSI
  controllers, 8 disks) attach individually to the CM-5 network. The software
  stripes across all nodes in the system. Thus, the collection of nodes is
  called a disk array. Multiple file systems across the array. Flexible
  redundancy. RAID~3 is used, i.e., bit-striped and a single parity disk. Remote
  access via NFS supported. Files stored in canonical order, with special
  hardware to help distribute data across processors. See best:cmmdio.}
}

@TechReport{thomas:panda,
  author = {Joel T. Thomas},
  title = {The {Panda} Array {I/O} Library on the {Galley} Parallel File
  System},
  year = {1996},
  month = {June},
  number = {PCS-TR96-288},
  institution = {Dept. of Computer Science, Dartmouth College},
  copyright = {Joel T. Thomas},
  note = {Senior Honors Thesis.},
  URL = {https://digitalcommons.dartmouth.edu/senior_theses/175/},
  keywords = {multiprocessor file system, parallel I/O, pario-bib},
  abstract = {The Panda Array I/O library, created at the University of
  Illinois, Urbana-Champaign, was built especially to address the needs of
  high-performance scientific applications. I/O has been one of the most
  frustrating bottlenecks to high performance for quite some time, and the
  Panda project is an attempt to ameliorate this problem while still providing
  the user with a simple, high-level interface. The Galley File System, with
  its hierarchical structure of files and strided requests, is another attempt
  at addressing the performance problem. My project was to redesign the Panda
  Array library for use on the Galley file system. This project involved
  porting Panda's three main functions: a checkpoint function for writing a
  large array periodically for 'safekeeping,' a restart function that would
  allow a checkpointed file to be read back in, and finally a timestep function
  that would allow the user to write a group of large arrays several times in a
  sequence. Panda supports several different distributions in both the
  compute-node memories and I/O-node disks. \par We have found that the Galley
  File System provides a good environment on which to build high-performance
  libraries, and that the mesh of Panda and Galley was a successful
  combination.},
  comment = {See seamons:thesis.}
}

@Article{thomasian:allocation,
  author = {Alexander Thomasian},
  title = {Data allocation and scheduling in disks and disk arrays},
  journal = {Lecture Notes in Computer Science},
  booktitle = {IEEE/CS Symposium on Modeling, Analysis, and Simulation of
  Computer and Telecommunication Systems; October 12, 2003; Orlando, FL},
  editor = {Calzarossa, MC; Gelenbe, E},
  year = {2004},
  month = {April},
  volume = {2965},
  pages = {357--384},
  institution = {New Jersey Inst Technol, Dept Comp Sci, Newark, NJ 07102 USA},
  publisher = {Springer-Verlag},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  URL =
  {http://springerlink.metapress.com/openurl.asp?genre=article&issn=0302-9743&volume=2965&spage=357},
  keywords = {data allocation, scheduling, disk arrays, pario-bib},
  abstract = {Magnetic disks, which together with disk arrays constitute a
  multibillion dollar industry, were developed in 1950s. Disks were an advance
  over magnetic drums, which had a dedicated read/write head per track, since
  much higher amounts of data could be accessed in a cost effective manner due
  to the sharability of the movable read/write heads. DRAM memories, which are
  volatile, were projected to replace disks a decade ago (see Section 2.4 in
  [33]). This did not materialize due to the inherent volatility of DRAM, i.e.,
  a power source is required to ensure that DRAM contents are not lost, but
  also due to recent dramatic increases in areal recording density and hence
  disk capacity, which is estimated at 60% compound annual growth rate - CAGR.
  This has resulted in a rapid decrease in cost per megabyte of disk capacity,
  so that it is lower than DRAM by a factor of 1000 to one.}
}

@InProceedings{tierney:cache,
  author = {Brian L. Tierney and Jason Lee and Brian Crowley and Mason Holding
  and Jeremy Hylton and Fred L. {Drake, Jr.}},
  title = {A Network-Aware Distributed Storage Cache for Data-Intensive
  Environments},
  booktitle = {Proceedings of the Eighth IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1999},
  month = {August},
  pages = {185--193},
  publisher = {IEEE Computer Society Press},
  address = {Redondo Beach, CA},
  URL = {http://computer.org/conferen/proceed/hpdc/0287/02870033abs.htm},
  keywords = {distributed cache, distributed computing, grid, input/output,
  network-aware, parallel I/O, pario-bib},
  abstract = {Modern scientific computing involves organizing, moving,
  visualizing, and analyzing massive amounts of data at multiple sites around
  the world. The technologies, the middleware services, and the architectures
  that are used to build useful high-speed, wide area distributed systems,
  constitute the field of data intensive computing. In this paper we will
  describe an architecture for data intensive applications where we use a
  high-speed distributed data cache as a common element for all of the sources
  and sinks of data. This cache-based approach provides standard interfaces to
  a large, application-oriented, distributed, on-line, transient storage
  system. We describe our implementation of this cache, how we have made it
  "network aware," and how we do dynamic load balancing based on the current
  network conditions. We also show large increases in application throughput by
  access to knowledge of the network conditions.},
  comment = {They discuss their implemetation of a "netowork aware" data cache
  (Distributed Parallel Storage System) that adapts to changing network
  conditions. The system itself looks much like the Galley File System. The
  client library is multi-threaded with a client thread for each DPSS server. A
  DPSS server is composed of a a block request thread, a block writer thread, a
  shared disk cache and a reader thread for each disk. Block requests move into
  the shared cache from the disks. A DPSS master directs the clients requests
  to an appropriate DPSS server. They use Java agents to monitor network
  performance and use a data replication for load balancing. A minimum cost
  flow algorithm is run each time a client request arrives to detirmine the
  best place to retrieve the data block. They argue that since the algorithm is
  fast (< 1 ms), the overhead of the algorithm is not significant.}
}

@Manual{tmc:cmio,
  key = {TMC},
  title = {Programming the {CM I/O} System},
  year = {1990},
  month = {November},
  organization = {Thinking Machines Corporation},
  keywords = {parallel I/O, file system interface, multiprocessor file system,
  pario-bib},
  comment = {Have two types of files, parallel and serial, differing in the way
  data is laid out internally. Also have three modes for reading the file:
  synchronous, streaming (asynchronous), and buffered.}
}

@InProceedings{tobis:foam,
  author = {Michael Tobis and Chad Schafer and Ian Foster and Robert Jacob and
  John Anderson},
  title = {{FOAM}: Expanding the Horizons of Climate Modeling},
  booktitle = {Proceedings of SC97: High Performance Networking and Computing},
  year = {1997},
  month = {November},
  publisher = {IEEE Computer Society Press},
  URL = {http://doi.acm.org/10.1145/509593.509620},
  keywords = {parallel I/O, scientific application, pario-bib},
  abstract = {We report here on a project that expands the applicability of
  dynamic climate modeling to very long time scales. The Fast Ocean_Atmosphere
  Model (FOAM) is a coupled ocean-atmosphere model that incorporates physics of
  interest in understanding decade to century time scale variability. It
  addresses the high computational cost of this endeavor with a combination of
  improved ocean model formulation, low atmosphere resolution, and efficient
  coupling. It also uses message-passing parallel processing techniques,
  allowing for the use of cost-effective distributed memory platforms. The
  resulting model runs over 6000 times faster than real time with good fidelity
  and has yielded significant results.},
  comment = {This paper is about the Fast Ocean-Atmosphere Model (FOAM), a
  climate model that uses ''a combination of new model formulation and parallel
  computing to expand the time horizon that may be addressed by explicit fluid
  dynamical representations of the climate system.'' Their model uses message
  passing on massively parallel distributed-memory computer systems. They are
  in the process of investigating using parallel I/O to further increase their
  efficiency.}
}

@InProceedings{toledo:solar,
  author = {Sivan Toledo and Fred G. Gustavson},
  title = {The Design and Implementation of {SOLAR}, a Portable Library for
  Scalable Out-of-Core Linear Algebra Computations},
  booktitle = {Proceedings of the Fourth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1996},
  month = {May},
  pages = {28--40},
  publisher = {ACM Press},
  address = {Philadelphia},
  keywords = {parallel I/O, out-of-core, linear algebra, pario-bib},
  abstract = {SOLAR is a portable high-performance library for out-of-core
  dense matrix computations. It combines portability with high performance by
  using existing high-performance in-core subroutine libraries and by using an
  optimized matrix input-output library. SOLAR works on parallel computers,
  workstations, and personal computers. It supports in-core computations on
  both shared-memory and distributed-memory machines, and its matrix
  input-output library supports both conventional I/O interfaces and parallel
  I/O interfaces. This paper discusses the overall design of SOLAR, its
  interfaces, and the design of several important subroutines. Experimental
  results show that SOLAR can factor on a single workstation an out-of-core
  positive-definite symmetric matrix at a rate exceeding 215 Mflops, and an
  out-of-core general matrix at a rate exceeding 195 Mflops. Less than 16\% of
  the running time is spent on I/O in these computations. These results
  indicate that SOLAR's portability does not compromise its performance. We
  expect that the combination of portability, modularity, and the use of a
  high-level I/O interface will make the library an important platform for
  research on out-of-core algorithms and on parallel I/O.},
  comment = {Sounds great. Library package that supports LAPACK-like
  functionality on in-core and out-of-core matrices. Good performance. Good
  portability (IBM workstation, IBM SP-2, and OS/2 laptop). They separate the
  matrix algorithms from the underlying I/O routines in an interesting way
  (read and write submatrices), leaving just enough information to allow the
  I/O system to do some higher-level optimizations.}
}

@InCollection{toledo:survey,
  author = {Sivan Toledo},
  title = {A Survey of Out-of-Core Algorithms in Numerical Linear Algebra},
  booktitle = {External Memory Algorithms and Visualization},
  editor = {James Abello and Jeffrey Scott Vitter},
  crossref = {abello:dimacs},
  year = {1999},
  series = {DIMACS Series in Discrete Mathematics and Theoretical Computer
  Science},
  pages = {161--180},
  publisher = {American Mathematical Society Press},
  address = {Providence, RI},
  keywords = {out-of-core algorithm, survey, numerical analysis, linear
  algebra, pario-bib},
  comment = {See also the component papers vitter:survey, arge:lower,
  crauser:segment, grossi:crosstrees, toledo:survey. Not clear to what extent
  these papers are about *parallel* I/O.}
}

@InProceedings{tomkins:multi-process,
  author = {Andrew Tomkins and R. Hugo Patterson and Garth Gibson},
  title = {Informed Multi-Process Prefetching and Caching},
  booktitle = {Proceedings of the 1997 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1997},
  month = {June},
  pages = {100--114},
  publisher = {ACM Press},
  URL =
  {http://www.acm.org/pubs/citations/proceedings/metrics/258612/p100-tomkins/},
  keywords = {pario-bib}
}

@InProceedings{torrellas:PnetCDF,
  author = {Jianwei Li and Wei-keng Liao and Alok Choudhary and Robert Ross and
  Rajeev Thakur and William Gropp and Rob Latham and Andrew Siegel and Brad
  Gallagher and Michael Zingale},
  title = {Parallel {netCDF}: A High-Performance Scientific {I/O} Interface},
  booktitle = {Proceedings of SC2003: High Performance Networking and
  Computing},
  year = {2003},
  month = {November},
  publisher = {IEEE Computer Society Press},
  address = {Phoenix, AZ},
  URL = {http://www.sc-conference.org/sc2003/paperpdfs/pap258.pdf},
  keywords = {parallel I/O interface, netCDF, MPI-IO, pario-bib},
  abstract = {Dataset storage, exchange, and access play a critical role in
  scientific applications. For such purposes netCDF serves as a portable,
  efficient file format and programming interface, which is popular in numerous
  scientific application domains. However, the original interface does not
  provide an efficient mechanism for parallel data storage and access. \par In
  this work, we present a new parallel interface for writing and reading netCDF
  datasets. This interface is derived with minimal changes from the serial
  netCDF interface but defines semantics for parallel access and is tailored
  for high performance. The underlying parallel I/O is achieved through MPI-IO,
  allowing for substantial performance gains through the use of collective I/O
  optimizations. We compare the implementation strategies and performance with
  HDF5. Our tests indicate programming convenience and significant I/O
  performance improvement with this parallel netCDF (PnetCDF) interface.},
  comment = {published on the web only}
}

@InProceedings{towsley:cpuio,
  author = {Donald F. Towsley},
  title = {The Effects of {CPU: I/O} Overlap in Computer System
  Configurations},
  booktitle = {Proceedings of the 5th Annual International Symposium on
  Computer Architecture},
  year = {1978},
  month = {April},
  pages = {238--241},
  keywords = {parallel processing, parallel I/O, pario-bib},
  comment = {Difficult to follow since it is missing its figures. ``Our most
  important result is that multiprocessor systems can benefit considerably more
  than single processor systems with the introduction of CPU: I/O overlap.''
  They overlap I/O needed by some future CPU sequence with the current CPU
  operation. They claim it looks good for large numbers of processors. Their
  orientation seems to be for multiprocessors operating on independent tasks.}
}

@Article{towsley:cpuio-parallel,
  author = {D. Towsley and K. M. Chandy and J. C. Browne},
  title = {Models for Parallel Processing within Programs: {Application} to
  {CPU: I/O} and {I/O: I/O} Overlap},
  journal = {Communications of the ACM},
  year = {1978},
  month = {October},
  volume = {21},
  number = {10},
  pages = {821--831},
  keywords = {parallel processing, parallel I/O, pario-bib},
  comment = {Models CPU:I/O and I/O:I/O overlap within a program. ``Overlapping
  is helpful only when it allows a device to be utilized which would not be
  utilized without overlapping.'' In general the overlapping seems to help.}
}

@InProceedings{trabado:io,
  author = {Guillermo P. Trabado and E. L. Zapata},
  title = {Support for Massive Data Input/Output on Parallel Computers},
  booktitle = {Proceedings of the Fifth Workshop on Compilers for Parallel
  Computers},
  year = {1995},
  month = {June},
  pages = {347--356},
  keywords = {parallel I/O, sparse matrix, pario-bib},
  comment = {They discuss a library to support irregular data structures,
  really sparse matrices, on distributed-memory machines. Their library
  supports several in-memory and out-of-core data distributions, and routines
  to read and write matrices in those distributions. The paper is sketchy and
  poorly written. There is little material on I/O.}
}

@InProceedings{tran:adaptive,
  author = {Nancy Tran and Daniel A. Reed},
  title = {{ARIMA} time series modeling and forecasting for adaptive {I/O}
  prefetching},
  booktitle = {Proceedings of the 15th international conference on
  Supercomputing},
  year = {2001},
  month = {June},
  pages = {473--485},
  URL = {http://doi.acm.org/10.1145/377792.377905},
  keywords = {pario-bib, access pattern, prefetching, modeling, time-series
  analysis},
  abstract = {Bursty application I/O patterns, together with transfer limited
  storage devices, combine to create a major I/O bottleneck on parallel
  systems. This paper explores the use of time series models to forecast
  application I/O request times, then prefetching I/O requests during
  computation intervals to hide I/O latency. Experimental results with I/O
  intensive scientific codes show performance improvements compared to standard
  UNIX prefetching strategies.}
}

@Article{tran:jadaptive,
  author = {Nancy Tran and Daniel A. Reed},
  title = {Automatic {ARIMA} Time Series Modeling for Adaptive {I/O}
  Prefetching},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {2004},
  month = {April},
  volume = {15},
  number = {4},
  pages = {362--377},
  publisher = {IEEE Computer Society Press},
  earlier = {tran:adaptive},
  URL = {http://csdl.computer.org/dl/trans/td/2004/04/l0362.pdf},
  keywords = {pario-bib, access pattern, prefetching, modeling, time-series
  analysis},
  abstract = {Inadequate I/O performance remains a major challenge in using
  high-end computing systems effectively. To address this problem, the paper
  presents TsModeler, an automatic time series modeling and prediction
  framework for adaptive I/O prefetching that uses ARIMA time series models to
  predict the temporal patterns of I/O requests. These online pattern analysis
  techniques and cutoff indicators for autocorrelation patterns enable
  multistep online predictions suitable for multiblock prefetching. This work
  also combines time series predictions with spatial Markov model predictions
  to determine when, what, and how many blocks to prefetch. Experimental
  results show reductions in execution time compared to the standard Linux file
  system across various hardware configurations.}
}

@Article{triantafillou:overlay,
  author = {Peter Triantafillou and Christos Faloutsos},
  title = {Overlay striping and optimal parallel {I/O} for modern
  applications},
  journal = {Parallel Computing},
  year = {1998},
  month = {January},
  volume = {24},
  number = {1},
  pages = {21--43},
  URL = {http://dx.doi.org/10.1016/S0167-8191(97)00115-4},
  keywords = {parallel I/O, striping, pario-bib},
  abstract = {Disk array systems are rapidly becoming the secondary-storage
  media of choice for many emerging applications with large storage and high
  bandwidth requirements. Striping data across the disks of a disk array
  introduces significant performance benefits mainly because the effective
  transfer rate of the secondary storage is increased by a factor equal to the
  stripe width. However, the choice of the optimal stripe width is an open
  problem: no general formal analysis has been reported and intuition alone
  fails to provide good guidelines. As a result one may find occasionally
  contradictory recommendations in the literature. With this work we first
  contribute an analytical calculation of the optimal stripe width. Second, we
  recognize that the optimal stripe width is sensitive to the multiprogramming
  level, which is not known a priori and fluctuates with time. Thus,
  calculations of the optimal stripe width are, by themselves only, of little
  practical use. For this reason we propose a novel striping technique, called
  overlay striping, which allows objects to be retrieved using a number of
  alternative stripe widths. We provide the detailed algorithms for our overlay
  striping method and we study the associated storage overhead and performance
  improvements and we show that we can achieve near optimal performance for
  very wide ranges of the possible multiprogramming levels, while incurring
  small storage overheads.},
  comment = {Part of a special issue.}
}

@InProceedings{tridgell:hidios,
  author = {Andrew Tridgell and David Walsh},
  title = {The {HiDIOS} filesystem},
  booktitle = {Proceedings of the Fourth International Parallel Computing
  Workshop},
  year = {1995},
  month = {September},
  pages = {53--63},
  address = {London, England},
  URL = {ftp://nimbus.anu.edu.au/pub/tridge/HiDIOS/hidios.ps.gz},
  keywords = {parallel file system, pario-bib},
  comment = {A description of their new parallel file system for the AP-1000.
  Conceptually, not much new here.}
}

@InProceedings{trieber:raid5,
  author = {Kent Treiber and Jai Menon},
  title = {Simulation Study of Cached {RAID5} Designs},
  booktitle = {Proceedings of the First Conference on High-Performance Computer
  Architecture},
  year = {1995},
  month = {January},
  pages = {186--197},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, RAID, disk array, pario-bib},
  abstract = {This paper considers the performance of cached RAID5 using
  simulations that are driven by database I/O traces collected at customer
  sites. This is in contrast to previous performance studies using analytical
  modelling or random-number simulations. We studied issues of cache size, disk
  buffering, cache replacement policies, cache allocation policies, destage
  policies and striping. Our results indicate that: read caching has
  considerable value; a small amount of cache should be used for writes fast
  write logic can reduce disk utilization for writes by an order of magnitude;
  priority queueing should be supported at the disks; disk buffering prefetch
  should be used; for large caches, it pays to cache sequentially accessed
  blocks; RAID5 with cylinder striping is superior to parity striping.}
}

@Article{tsujita:mpi-io,
  author = {Yuichi Tsujita},
  title = {Effective nonblocking {MPI-I/O} in remote {I/O} operations using a
  multithreaded mechanism},
  journal = {Lecture Notes in Computer Science},
  booktitle = {2nd International Symposium on Parallel and Distributed
  Processing and Applications; December 13-15, 2004; Hong Kong, PEOPLES R
  CHINA},
  editor = {Cao, J; Yang, LT; Guo, M; Lau, F},
  year = {2004},
  month = {November},
  volume = {3358},
  pages = {34--43},
  institution = {Kinki Univ, Fac Engn, Dept Elect Engn \& Comp Sci,
  Higashihiroshima, Hiroshima 7392116, Japan},
  publisher = {SPRINGER-VERLAG BERLIN},
  copyright = {(c)2005 The Thomson Corporation},
  URL =
  {http://www.springerlink.com/openurl.asp?genre=article&issn=0302-9743&volume=3358&spage=34},
  keywords = {stampi, MPI-I/O, dynamic process creation, multithreaded, overlap
  computation and I/O, pario-bib},
  abstract = {A flexible intermediate library named Stampi realizes seamless
  MPI operations on interconnected parallel computers. Dynamic process creation
  and MPI-I/O operations both inside a computer and among computers are
  available with it. MPI-I/O operations to a remote computer are realized by
  MPI-I/O processes of the Stampi library which are invoked on a remote
  computer using a vendor-supplied MPI-I/O library. If the vendor-supplied one
  is not available, a single MPI-I/O process is invoked on a remote computer,
  and it uses UNIX I/O functions instead of the vendor-supplied one. In
  nonblocking MPI-I/O functions with multiple user processes, the single
  MPI-I/O process carries out I/O operations required by the processes
  sequentially. This results in small overlap of computation by the user
  processes with I/O operations by the MPI-I/O process. Therefore performance
  of the nonblocking functions is poor with multiple user processes. To realize
  effective I/O operations, a Pthreads library has been implemented in the
  MPI-I/O mechanism, and multi-threaded I/O operations have been realized. The
  newly implemented MPI-I/O mechanism has been evaluated on inter-connected PC
  clusters, and higher overlap of the computation with the I/O operations has
  been achieved.},
  comment = {also see tsujita:stampi*.}
}

@InProceedings{tsujita:stampi,
  author = {Yuichi Tsujita and Toshiyuki Imamura and Hiroshi Takemiya and
  Nobuhiro Yamagishi},
  title = {{Stampi-I/O}: A Flexible Parallel-{I/O} Library for Heterogeneous
  Computing Environment},
  booktitle = {Recent Advances in Parallel Virtual Machine and Message Passing
  Interface},
  year = {2002},
  series = {Lecture Notes in Computer Science},
  volume = {2474},
  pages = {288--?},
  publisher = {Springer-Verlag},
  URL =
  {http://link.springer.de/link/service/series/0558/bibs/2474/24740288.htm},
  URLpdf =
  {http://link.springer.de/link/service/series/0558/papers/2474/24740288.pdf},
  keywords = {parallel I/O, multiprocessor file system, pario-bib},
  abstract = {An MPI-2 based parallel-I/O library, Stampi-I/O, has been
  developed using flexible communication infrastructure. In Stampi-I/O almost
  all MPI-I/O functions have been implemented. We can execute these functions
  using both local and remote I/O operations with the same application program
  interface (API) based on MPI-2. In I/O operations using Stampi-I/O, users
  need not handle any differences in the communication mechanism of computers.
  We have evaluated performance for primitive functions in Stampi-I/O. Through
  this test, sufficient performance has been achieved and effectiveness of our
  flexible implementation has been confirmed.}
}

@InProceedings{tsujita:stampi2,
  author = {Yuichi Tsujita},
  title = {Implementation of an {MPI-I/0} mechanism using {PVFS} in remote
  {I/0} to a {PC} cluster.},
  booktitle = {Seventh International Conference on High Performance Computing
  and Grid in Asia Pacific Region},
  year = {2004},
  month = {July},
  pages = {136--139},
  organization = {Kinki University, Japan},
  publisher = {Los Alamitos, CA, USA : IEEE Comput. Soc, 2004},
  copyright = {(c)2005 IEE},
  address = {Tokyo, Japan},
  URL =
  {http://csdl.computer.org/comp/proceedings/hpcasia/2004/2138/00/21380136abs.htm},
  keywords = {MPI-IO, PVFS, remote I/O, grid, pario-bib},
  abstract = {A flexible intermediate library named Stampi realizes seamless
  MPI operations on a heterogeneous computing environment. With the help of a
  flexible communication mechanism of this library, users can execute MPI
  functions without awareness of underlying communication mechanism. Although
  Stampi supports MPI-I/O among different platforms, UNIX I/O functions are
  used when a vendor-supplied MPI-I/O library is not available. To realize
  distributed I/O operations, a parallel virtual file system (PVFS) has been
  implemented in the MPI-I/O mechanism. Primitive MPI-I/O functions of Stampi
  have been evaluated and sufficient performance has been achieved. (9 refs.)},
  comment = {also see tsujita:stampi.}
}

@InProceedings{uk:protein-folding,
  author = {B. Uk and M. Taufer and T. Stricker and G. Settanni and A. Cavalli
  and A. Caflisch},
  title = {Combining Task- and Data Parallelism to Speed up Protein Folding on
  a Desktop Grid Platform},
  booktitle = {Proceedings of the Third IEEE/ACM International Symposium on
  Cluster Computing and the Grid},
  year = {2003},
  month = {May},
  pages = {240--249},
  publisher = {IEEE Computer Society Press},
  address = {Tokyo, Japan},
  URL =
  {http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190240abs.htm},
  keywords = {protein folding, grid application, parallel I/O, pario-app,
  pario-bib},
  abstract = {The steady increase of computing power at lower and lower cost
  enables molecular dynamics simulations to investigate the process of protein
  folding with an explicit treatment of water molecules. Such simulations are
  typically done with well known computational chemistry codes like CHARMM.
  Desktop grids such as the United Devices MetaProcessor are highly attractive
  platforms, since scavenging for unused machines on Intra- and Internet
  delivers compute power that is almost free. However, the predominant
  programming paradigm for current desktop grids is pure task parallelism and
  might not fit the needs for protein folding simulations with explicit water
  molecules. A short overall turn-around time of a simulation remains highly
  important for research productivity, but the need for an accurate model and
  long simulation time-scales leads to tasks that are too large for optimal
  scheduling on a desktop grid. To address this problem, we introduce a
  combination of task- and data parallelism as a well suitable computing
  paradigm for protein folding investigations on grid platforms. As a proof of
  concept, we design and implement a simple system for protein folding
  simulations based on the notion of combined task and data parallelism with
  clustered workers. Clustered workers are machines grouped into small clusters
  according to network and CPU performance criteria and act as super-nodes
  within a desktop grid, permitting the utilization of data parallelism in
  addition to the task parallelism. We integrate our new paradigm into the
  existing software environment of the United Devices MetaProcessor. For a test
  protein, we reach a better quality of the folding calculations than we
  reached using just task parallelism on distributed systems.}
}

@InProceedings{uysal:mems,
  author = {Mustafa Uysal and Arif Merchant and Guillermo A. Alvarez},
  title = {Using {MEMS}-based storage in disk arrays},
  booktitle = {Proceedings of the USENIX FAST '03 Conference on File and
  Storage Technologies},
  year = {2003},
  month = {April},
  pages = {89--101},
  publisher = {USENIX Association},
  address = {San Francisco, CA},
  URL = {http://www.usenix.org/events/fast03/tech/uysal.html},
  keywords = {mems-based storage, disk arrays, pario-bib},
  abstract = {Current disk arrays, the basic building blocks of
  high-performance storage systems, are built around two memory technologies:
  magnetic disk drives, and non-volatile DRAM caches. Disk latencies are higher
  by six orders of magnitude than non-volatile DRAM access times, but cache
  costs over 1000 times more per byte. A new storage technology based on
  microelectromechanical systems (MEMS) will soon offer a new set of
  performance and cost characteristics that bridge the gap between disk drives
  and the caches. We evaluate potential gains in performance and cost by
  incorporating MEMS-based storage in disk arrays. Our evaluation is based on
  exploring potential placements of MEMS-based storage in a disk array. We used
  detailed disk array simulators to replay I/O traces of real applications for
  the evaluation. We show that replacing disks with MEMS-based storage can
  improve the array performance dramatically, with a cost performance ratio
  several times better than conventional arrays even if MEMS storage costs ten
  times as much as disk. We also demonstrate that hybrid MEMS/disk arrays,
  which cost less than purely MEMS-based arrays, can provide substantial
  improvements in performance and cost/performance over conventional arrays.},
  comment = {Best paper in fast2003.}
}

@MastersThesis{vaitzblit:media,
  author = {Lev Vaitzblit},
  title = {The Design and Implementation of a High-Bandwidth File Service for
  Continuous Media},
  year = {1991},
  month = {September},
  school = {MIT},
  keywords = {multimedia, distributed file system, disk striping, pario-bib},
  comment = {A DFS for multimedia. Expect large files, read-mostly, highly
  sequential. Temporal synchronization is key. An administration server handles
  opens and closes, and provides guarantees on performance (like Swift). The
  interface at the client nodes talks to the admin server transparently, and
  stripes requests over all storage nodes. Storage nodes may internally use
  RAIDs, I suppose. Files are a series of frames, rather than bytes. Each frame
  has a time offset in seconds. Seeks can be by frame number or time offset.
  File containers contain several files, and have attributes that specify
  performance requirements. Interface does prefetching, based on read direction
  (forward or backward) and any frame skips. But frames are not transmitted
  from storage server to client node until requested (client pacing). Claim
  that synchronous disk interleaving with a striping unit of one frame is best.
  Could get 30 frames/sec (3.5MB/s) with 2 DECstation 5000s and 4 disks,
  serving a client DEC 5000.}
}

@InProceedings{vandegoor:unixio,
  author = {A. J. {van de Goor} and A. Moolenaar},
  title = {{UNIX I/O} in a Multiprocessor System},
  booktitle = {Proceedings of the 1988 Winter USENIX Conference},
  year = {1988},
  pages = {251--258},
  keywords = {unix, multiprocessor file system, pario-bib},
  comment = {How to split up the internals of the Unix I/O system to run on a
  shared-memory multiprocessor in a non-symmetric OS. They decided to split the
  functionality just above the buffer cache level, putting the buffer cache
  management and device drivers on the special I/O processors.}
}

@InProceedings{vanderleest:contention,
  author = {Steven VanderLeest and Ravishankar K. Iyer},
  title = {Measurement of {I/O} Bus Contention and Correlation among
  Heterogeneous Device Types in a Single-bus Multiprocessor System},
  booktitle = {Proceedings of the IPPS~'94 Workshop on Input/Output in Parallel
  Computer Systems},
  year = {1994},
  month = {April},
  pages = {36--53},
  organization = {Univ of Illinois, Urbana-Champaign},
  note = {Also appeared in Computer Architecture News 22(4)},
  later = {sinclair:instability-book},
  keywords = {parallel I/O, pario-bib},
  comment = {Using a hardware monitor they measure the I/O-bus usage on a
  4-processor Sun workstation. They characterize the bus contention caused by
  multiple different devices (disk, screen, and network). The contention
  sometimes caused significant performance degradation (to the end-user)
  despite the bus not being overloaded.}
}

@InCollection{vanderleest:contention-book,
  author = {Steven H. VanderLeest and Ravishankar K. Iyer},
  title = {Heterogeneous {I/O} Contention in a Single-bus Multiprocessor},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {14},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {313--331},
  publisher = {Kluwer Academic Publishers},
  earlier = {vanderleest:contention},
  keywords = {parallel I/O, pario-bib},
  abstract = {None.},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@Article{varki:issues,
  author = {E. Varki and A. Merchant and J. Z. Xu and X. Z. Qiu},
  title = {Issues and challenges in the performance analysis of real disk
  arrays},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {2004},
  month = {June},
  volume = {15},
  number = {6},
  pages = {559 -- 574},
  institution = {Univ New Hampshire, Dept Comp Sci, Nesmith Hall, Durham, NH
  03824 USA; Univ New Hampshire, Dept Comp Sci, Durham, NH 03824 USA; Hewlett
  Packard Labs, Storage Syst Dept, Palo Alto, CA 94304 USA; Falconstor Software
  Inc, Melville, NY 11747 USA},
  publisher = {IEEE Computer Society Press},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  keywords = {performance analysis, disk arrays, performance modeling,
  pario-bib},
  abstract = {The performance modeling and analysis of disk arrays is
  challenging due to the presence of multiple disks, large array caches, and
  sophisticated array controllers. Moreover, storage manufacturers may not
  reveal the internal algorithms implemented in their devices, so real disk
  arrays are effectively black-boxes. We use standard performance techniques to
  develop an integrated performance model that incorporates some of the
  complexities of real disk arrays. We show how measurement data and baseline
  performance models can be used to extract information about the various
  features implemented in a disk array. In this process, we identify areas for
  future research in the performance analysis of real disk arrays.}
}

@InCollection{varma:bdestage,
  author = {Anujan Varma and Quinn Jacobson},
  title = {Destage Algorithms for Disk Arrays with Non-Volatile Caches},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {10},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {129--146},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {varma:destage},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {parallel I/O, disk array, RAID, disk caching, pario-bib},
  comment = {Part of jin:io-book; reformatted version of varma:destage.}
}

@Article{varma:destage,
  author = {Anujan Varma and Quinn Jacobson},
  title = {Destage Algorithms for Disk Arrays with Non-Volatile Caches},
  journal = {IEEE Transactions on Computers},
  year = {1998},
  month = {February},
  volume = {47},
  number = {2},
  later = {varma:bdestage},
  URL = {http://computer.org/tc/tc1998/t0228abs.htm},
  keywords = {parallel I/O, disk array, RAID, disk caching, pario-bib},
  abstract = {In a disk array with a nonvolatile write cache, destages from the
  cache to the disk are performed in the background asynchronously while read
  requests from the host system are serviced in the foreground. In this paper,
  we study a number of algorithms for scheduling destages in a RAID-5 system.
  We introduce a new scheduling algorithm, called linear threshold scheduling,
  that adaptively varies the rate of destages to disks based on the
  instantaneous occupancy of the write cache. The performance of the algorithm
  is compared with that of a number of alternative scheduling approaches, such
  as least-cost scheduling and high/low mark. The algorithms are evaluated in
  terms of their effectiveness in making destages transparent to the servicing
  of read requests from the host, disk utilization, and their ability to
  tolerate bursts in the workload without causing an overflow of the write
  cache. Our results show that linear threshold scheduling provides the best
  read performance of all the algorithms compared, while still maintaining a
  high degree of burst tolerance. An approximate implementation of the
  linear-threshold scheduling algorithm is also described. The approximate
  algorithm can be implemented with much lower overhead, yet its performance is
  virtually identical to that of the ideal algorithm.}
}

@Article{varman:bounds,
  author = {Peter J. Varman and Rakesh M. Verma},
  title = {Tight Bounds for prefetching and Buffer Management Algorithms for
  Parallel {I/O} Systems},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {1999},
  month = {December},
  volume = {10},
  number = {12},
  pages = {1262--1275},
  URL = {http://ieeexplore.ieee.org/iel5/71/17782/00819948.pdf},
  keywords = {parallel I/O, prefetching, pario-bib},
  abstract = {The I/O performance of applications in multiple-disk systems can
  be improved by overlapping disk accesses. This requires the use of
  appropriate prefetching and buffer management algorithms that ensure the most
  useful blocks are accessed and retained in the buffer. In this paper, we
  answer several fundamental questions on prefetching and buffer management for
  distributed-buffer parallel I/O systems. First, we derive and prove the
  optimality of an algorithm, P-min, that minimizes the number of parallel
  I/Os. Second, we analyze P-con, an algorithm that always matches its
  replacement decisions with those of the well-known demand-paged MIN
  algorithm. We show that P-con can become fully sequential in the worst case.
  Third, we investigate the behavior of on-line algorithms for multiple-disk
  prefetching and buffer management. We define and analyze P-Iru, a parallel
  version of the traditional LRU buffer management algorithm. Unexpectedly, we
  find that the competitive ratio of P-Iru is independent of the number of
  disks. Finally, we present the practical performance of these algorithms on
  randomly generated reference strings. These results confirm the conclusions
  derived from the analysis on worst case inputs.}
}

@InProceedings{vellanki:predict,
  author = {Vivekanand Vellanki and Ann Chervenak},
  title = {A Cost-Benefit Scheme for High Performance Predictive Prefetching},
  booktitle = {Proceedings of SC99: High Performance Networking and Computing},
  year = {1999},
  month = {November},
  publisher = {ACM Press and IEEE Computer Society Press},
  address = {Portland, OR},
  URL = {http://www.sc99.org/proceedings/papers/vellanki.pdf},
  keywords = {file prefetching, cost-benefit analysis, parallel I/O pario-bib},
  comment = {They describe a prefetching scheme which prefetches blocks using a
  cost-benefit analysis scheme based on the probability that the block will be
  accessed. The benefit of prefetching a block is compared to the cost of
  replacing another block from the cache. They were able to reduce cache miss
  rates by 36% for workloads which receive no benefit from sequential
  prefetching.}
}

@InProceedings{vengroff:TPIE,
  author = {Darren Erik Vengroff},
  title = {A Transparent Parallel {I/O} Environment},
  booktitle = {Proceedings of the 1994 DAGS/PC Symposium},
  year = {1994},
  month = {July},
  pages = {117--134},
  organization = {Dartmouth Institute for Advanced Graduate Studies},
  address = {Hanover, NH},
  URL = {ftp://ftp.cs.duke.edu/pub/dev/papers/tpie-dags94.ps.Z},
  keywords = {parallel I/O, parallel I/O algorithms, pario-bib},
  comment = {Interesting interface, providing high-level data-parallel access
  to vectors of data on disk. Implementation expectation is to use raw disk
  devices. Goals: abstraction, support for algorithmic optimality, flexible,
  portable, and extensible. TPIE is a set of C++ templates and libraries, where
  the user supplies callback functions to TPIE access methods. TPIE contains a
  small variety of access methods, each of which operates on a set of input and
  output streams, calling the user's function once for each set of input
  records. They can do scan, merge, distribution, sort, permute, batch filter,
  and distribution-sweep. There is a single thread of control (at least
  conceptually). Their first prototype is on a Sun SPARCstation; later,
  clusters of workstations and then a multiprocessor. See vengroff:efficient,
  vengroff:tpie-man.}
}

@InProceedings{vengroff:efficient,
  author = {Darren Erik Vengroff and Jeffrey Scott Vitter},
  title = {Supporting {I/O}-Efficient Scientific Computation in {TPIE}},
  booktitle = {Proceedings of the 1995 IEEE Symposium on Parallel and
  Distributed Processing},
  year = {1995},
  month = {October},
  pages = {74--77},
  publisher = {IEEE Computer Society Press},
  address = {San Antonio, TX},
  earlier = {vengroff:efficient-tr},
  later = {vengroff:efficient2},
  keywords = {parallel I/O, algorithm, run-time library, pario-bib},
  comment = {Shorter version of vengroff:efficient2. Excellent paper. This
  paper does not describe TPIE itself very much, but more about a set of
  benchmarks using TPIE. All of the benchmarks are run on one disk and one
  processor. TPIE can use multiple disks and one processor, with plans to
  extend it to multiple processors later. See vengroff:tpie and
  vengroff:efficient-tr. Same as vengroff:efficient2?}
}

@TechReport{vengroff:efficient-tr,
  author = {Darren Erik Vengroff and Jeffrey Scott Vitter},
  title = {{I/O}-Efficient Scientific Computation Using {TPIE}},
  year = {1995},
  month = {July},
  number = {CS--1995--18},
  institution = {Dept. of Computer Science, Duke University},
  later = {vengroff:efficient},
  keywords = {parallel I/O algorithm, scientific computing, runtime library,
  pario-bib},
  comment = {Expanded version of vengroff:efficient.}
}

@InProceedings{vengroff:efficient2,
  author = {Darren Erik Vengroff and Jeffrey Scott Vitter},
  title = {{I/O}-Efficient Scientific Computation Using {TPIE}},
  booktitle = {Proceedings of the Fifth NASA Goddard Conference on Mass Storage
  Systems and Technologies},
  year = {1996},
  month = {September},
  pages = {II:553--570},
  earlier = {vengroff:efficient},
  keywords = {parallel I/O algorithms, run-time support, parallel I/O,
  multiprocessor file system interface, pario-bib},
  comment = {Longer version of vengroff:efficient.}
}

@PhdThesis{vengroff:thesis,
  author = {Darren Erik Vengroff},
  title = {The Theory and Practice of {I/O}-Efficient Computation},
  year = {1996},
  month = {April},
  school = {Department of Computer Science, Brown University},
  address = {Providence, RI},
  keywords = {parallel I/O algorithm, pario-bib}
}

@Misc{vengroff:tpie-man,
  author = {Darren Erik Vengroff},
  title = {{TPIE} User Manual and Reference},
  year = {1995},
  month = {January},
  howpublished = {Available on the WWW at
  http://www.cs.duke.edu/\~{}dev/tpie_home_page.html},
  note = {Alpha release},
  URL = {http://www.cs.duke.edu/~dev/tpie_home_page.html},
  keywords = {parallel I/O, parallel I/O algorithm, file system interface,
  pario-bib},
  comment = {Currently an alpha version. It is in the process of being updated.
  The most current version is generally available on the WWW. See
  vengroff:tpie, vengroff:efficient.}
}

@InProceedings{venugopal:delays,
  author = {C.~R. Venugopal and S.~S.~S.~P. Rao},
  title = {Impact of Delays in Parallel {I/O} System: An Empirical Study},
  booktitle = {Proceedings of the Fifth IEEE International Symposium on High
  Performance Distributed Computing},
  year = {1996},
  month = {August},
  pages = {490-499},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O, pario-bib},
  abstract = {Performance of I/O intensive applications on a multiprocessor
  system depends mostly on the variety of disk access delays encountered in the
  I/O system. Over the years, the improvement in disk performance has taken
  place more slowly than the corresponding increase in processor speeds. It is
  therefore necessary to model I/O delays and evaluate performance benefits of
  moving an application to a better multiprocessor system. We perform such an
  analysis by measuring I/O delays for a synthesized application that uses a
  parallel distributed file system. The aim of this study is to evaluate the
  performance benefits of better disks in a multiprocessor system. We report on
  how the I/O performance would be affected if an application were to run on a
  system which would have better disks and communication links. In this study,
  we show a substantial improvement in the performance of an I/O system with
  better disks and communication links with respect to the existing system.}
}

@Misc{vetsch:visiblehuman,
  author = {S. Vetsch and V. Messerli and O. Figueiredo and B. Gennart and R.D.
  Hersch and L. Bovisi and R. Welz and L. Bidaut and O. Ratib},
  title = {Visible Human Slice Server},
  year = {1998},
  howpublished = {http://visiblehuman.epfl.ch/},
  note = {A web site giving access to 2D views of a 3D scan of a human body.},
  URL = {http://visiblehuman.epfl.ch/},
  keywords = {image processing, parallel I/O, pario-bib},
  abstract = {The computer scientists of EPFL (Prof. R.D. Hersch and his
  staff), in collaboration with the Geneva Hospitals and WDS Technologies SA,
  have developed a parallel image server to extract image slices of the Visible
  Human from any orientation. This 3D dataset originates from a prisoner
  sentenced to death who offered his body to science. The dead body was frozen
  and then cut and digitized into 1 mm horizontally spaced slices by the
  National Library of Medicine, Bethesda-Maryland and the University of
  Colorado, USA. The total volume of all slices represents a size of 13 Gbyte
  of data.},
  comment = {Very cool. See also gennart:CAP, messerli:tomographic,
  messerli:jimage, messerli:thesis.}
}

@InProceedings{vilayannur:caching,
  author = {Murali Vilayannur and Anand Sivasubramaniam and Mahmut Kandemir and
  Rajeev Thakur and Robert Ross},
  title = {Discretionary Caching for {I/O} on Clusters},
  booktitle = {Proceedings of the Third IEEE/ACM International Symposium on
  Cluster Computing and the Grid},
  year = {2003},
  month = {May},
  pages = {96--103},
  publisher = {IEEE Computer Society Press},
  address = {Tokyo, Japan},
  URL =
  {http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190096abs.htm},
  keywords = {caching, parallel I/O, pario-bib},
  abstract = {I/O bottlenecks are already a problem in many large-scale
  applications that manipulate huge datasets. This problem is expected to get
  worse as applications get larger, and the I/O subsystem performance lags
  behind processor and memory speed improvements. Caching I/O blocks is one
  effective way of alleviating disk latencies, and there can be multiple levels
  of caching on a cluster of workstations. Previous studies have shown the
  benefits of caching - whether it be local to a particular node, or a shared
  global cache across the cluster - for certain applications. However, we show
  that while caching is useful in some situations, it can hurt performance if
  we are not careful about what to cache and when to bypass the cache. This
  paper presents compilation techniques and runtime support to address this
  problem. These techniques are implemented and evaluated on an experimental
  Linux/Pentium cluster running a parallel file system. Our results using a
  diverse set of applications (scientific and commercial) demonstrate the
  benefits of a discretionary approach to caching for I/O subsystems on
  clusters, providing as much as 33% savings over indiscriminately caching
  everything in some applications.}
}

@InProceedings{vilayannur:posix-pvfs,
  author = {Murali Vilayannur and Robert B. Ross and Philip H. Carns and Rajeev
  Thakur and Anand Sivasubramaniam},
  title = {On the performance of the POSIX I/O interface to PVFS},
  booktitle = {12th Euromicro Conference on Parallel, Distributed and
  Network-Based Processing (PDP'04)},
  year = {2004},
  month = {February},
  pages = {332 -- 339},
  institution = {Penn State Univ, Dept Comp Sci \& Engn, University Pk, PA
  16802 USA},
  publisher = {IEEE Computer Society Press},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  address = {Coruna, Spain},
  URL =
  {http://csdl.computer.org/comp/proceedings/pdp/2004/2083/00/20830332abs.htm},
  keywords = {posix I/O interface, performance, PVFS, parallel file system,
  pario-bib},
  abstract = {The ever-increasing gap in performance between CPU/memory
  technologies and the I/O subsystem (disks, I/O buses) in modern workstations
  has exacerbated the I/O bottlenecks inherent in applications that access
  large disk resident data sets. A common technique to alleviate the I/O
  bottlenecks on clusters of workstations, is the use of parallel file systems.
  One such parallel file system is the Parallel Virtual File System (PVFS),
  which is a freely available tool to achieve high-performance I/O on
  Linux-based clusters. In this paper we describe the performance and
  scalability of the UNIX I/O interface to PVFS. To illustrate the performance,
  we present experimental results using Bonnie++, a commonly used file system
  benchmark to test file system throughput; a synthetic parallel I/O
  applicationfor calculating aggregate read and write bandwidths; and a
  synthetic benchmark which calculates the time taken to untar the Linux kernel
  source tree to measure performance of a large number of small file
  operations. We obtained aggregate read and write bandwidths as high as 550
  MB/s with a Myrinet-based network and 160MB/s with fast Ethernet.}
}

@InProceedings{vitter:jprefetch,
  author = {Jeffrey Scott Vitter and P. Krishnan},
  title = {Optimal Prefetching via Data Compression},
  booktitle = {Foundations of Computer Science},
  year = {1991},
  pages = {121--130},
  earlier = {vitter:prefetch},
  keywords = {prefetching, data compression, pario-bib}
}

@InProceedings{vitter:optimal,
  author = {Jeffrey Scott Vitter and Elizabeth A.~M. Shriver},
  title = {Optimal Disk {I/O} with Parallel Block Transfer},
  booktitle = {Proceedings of the 22nd Annual ACM Symposium on Theory of
  Computing (STOC~'90)},
  year = {1990},
  month = {May},
  pages = {159--169},
  keywords = {parallel I/O algorithms, parallel memory, pario-bib},
  comment = {Summary of vitter:parmem1 and vitter:parmem2.}
}

@Article{vitter:parmem1,
  author = {Jeffrey Scott Vitter and Elizabeth A. M. Shriver},
  title = {Algorithms for Parallel Memory {I}: Two-level Memories},
  journal = {Algorithmica},
  year = {1994},
  month = {August and September},
  volume = {12},
  number = {2/3},
  pages = {110--147},
  earlier = {vitter:parmem1-tr},
  keywords = {parallel I/O algorithms, parallel memory, pario-bib},
  abstract = {We provide the first optimal algorithms in terms of the number of
  input/outputs (I/Os) required between internal memory and multiple secondary
  storage devices for the problems of sorting, FFT, matrix transposition,
  standard matrix multiplication, and related problems. Our two-level memory
  model is new and gives a realistic treatment of {\em parallel block
  transfer}, in which during a single~I/O each of the $P$ secondary storage
  devices can simultaneously transfer a contiguous block of $B$ records. The
  model pertains to a large-scale uniprocessor system or parallel
  multiprocessor system with $P$ disks. In addition, the sorting, FFT,
  permutation network, and standard matrix multiplication algorithms are
  typically optimal in terms of the amount of internal processing time. The
  difficulty in developing optimal algorithms is to cope with the partitioning
  of memory into $P$~separate physical devices. Our algorithms' performance can
  be significantly better than those obtained by the well-known but nonoptimal
  technique of disk striping. Our optimal sorting algorithm is randomized, but
  practical; the probability of using more than $\ell$ times the optimal number
  of I/Os is exponentially small in $\ell (\log \ell) \log (M/B)$, where $M$ is
  the internal memory size.},
  comment = {See shorter version vitter:optimal. See TR version
  vitter:parmem1-tr. See also vitter:parmem2.}
}

@TechReport{vitter:parmem1-tr,
  author = {Jeffrey Scott Vitter and Elizabeth A. M. Shriver},
  title = {Algorithms for Parallel Memory {I}: Two-level Memories},
  year = {1993},
  month = {January},
  number = {CS--93--01},
  institution = {Dept. of Computer Science, Duke University},
  note = {A summary appears in STOC '90. Revised version of Brown CS--92--04.
  Appeared in Algorithmica August 1994.},
  later = {vitter:parmem1},
  URL = {ftp://ftp.cs.duke.edu/pub/dist/techreport/1993/1993-01.ps.gz},
  keywords = {parallel I/O algorithms, parallel memory, pario-bib},
  comment = {Summarized in vitter:optimal. Published as vitter:parmem1.}
}

@Article{vitter:parmem2,
  author = {Jeffrey Scott Vitter and Elizabeth A. M. Shriver},
  title = {Algorithms for Parallel Memory {II}: Hierarchical Multilevel
  Memories},
  journal = {Algorithmica},
  year = {1994},
  month = {August and September},
  volume = {12},
  number = {2/3},
  pages = {148--169},
  earlier = {vitter:parmem2-tr},
  keywords = {parallel I/O algorithms, parallel memory, pario-bib},
  abstract = {In this paper we introduce parallel versions of two hierarchical
  memory models and give optimal algorithms in these models for sorting, FFT,
  and matrix multiplication. In our parallel models, there are $P$ memory
  hierarchies operating simultaneously; communication among the hierarchies
  takes place at a base memory level. Our optimal sorting algorithm is
  randomized and is based upon the probabilistic partitioning technique
  developed in the companion paper for optimal disk sorting in a two-level
  memory with parallel block transfer. The probability of using $\ell$ times
  the optimal running time is exponentially small in~$\ell (\log \ell) \log
  P$.},
  comment = {Summarized in vitter:optimal.}
}

@TechReport{vitter:parmem2-tr,
  author = {Jeffrey Scott Vitter and Elizabeth A. M. Shriver},
  title = {Algorithms for Parallel Memory {II}: Hierarchical Multilevel
  Memories},
  year = {1993},
  month = {January},
  number = {CS--93--02},
  institution = {Dept. of Computer Science, Duke University},
  note = {A summary appears in STOC '90. Revised version of Brown CS--92--05.
  Appeared in Algorithmica 12(2,3).},
  later = {vitter:parmem2},
  URL = {ftp://ftp.cs.duke.edu/pub/dist/techreport/1993/1993-02.ps.gz},
  keywords = {parallel I/O algorithms, parallel memory, pario-bib},
  comment = {Summarized in vitter:optimal.}
}

@TechReport{vitter:prefetch,
  author = {Jeffrey Scott Vitter and P. Krishnan},
  title = {Optimal Prefetching via Data Compression},
  year = {1991},
  month = {July},
  number = {CS--91--46},
  institution = {Brown University},
  note = {A summary appears in FOCS '91},
  later = {vitter:jprefetch},
  URL = {ftp://ftp.cs.brown.edu/pub/techreports/91/cs91-46.ps.Z},
  keywords = {parallel I/O algorithms, disk prefetching, pario-bib},
  abstract = {Caching and prefetching are important mechanisms for speeding up
  access time to data on secondary storage. Recent work in competitive online
  algorithms has uncovered several promising new algorithms for caching. In
  this paper, we apply a form of the competitive philosophy for the first time
  to the problem of prefetching to develop an optimal universal prefetcher in
  terms of fault ratio, with particular applications to large-scale databases
  and hypertext systems. Our algorithms for prefetching are novel in that they
  are based on data compression techniques that are both theoretically optimal
  and good in practice. Intuitively, in order to compress data effectively, you
  have to be able to predict future data well, and thus good data compressors
  should be able to predict well for purposes of prefetching. We show for
  powerful models such as Markov sources and $m$th order Markov sources that
  the page fault rates incurred by our prefetching algorithms are optimal in
  the limit for almost all sequences of page accesses.},
  comment = {``This... is on prefetching, but I think the ideas will have a lot
  of use with parallel disks. The implementations we have now are doing
  amazingly well compared to LRU.'' [Vitter]. See vitter:jprefetch.}
}

@InProceedings{vitter:summary,
  author = {Jeffrey Scott Vitter},
  title = {Efficient Memory Access in Large-Scale Computation},
  booktitle = {Proceedings of the 1991 Symposium on Theoretical Aspects of
  Computer Science (STACS~'91)},
  year = {1991},
  series = {Lecture Notes in Computer Science},
  volume = {480},
  pages = {26--41},
  publisher = {Springer-Verlag},
  copyright = {Springer Verlag},
  address = {Berlin},
  keywords = {parallel I/O algorithms, sorting, pario-bib},
  comment = {Good overview of all the other papers.}
}

@InCollection{vitter:survey,
  author = {Jeffrey Scott Vitter},
  title = {External Memory Algorithms and Data Structures: dealing with massive
  data},
  booktitle = {External Memory Algorithms and Visualization},
  editor = {James Abello and Jeffrey Scott Vitter},
  crossref = {abello:dimacs},
  year = {1999},
  series = {DIMACS Series in Discrete Mathematics and Theoretical Computer
  Science},
  pages = {1--38},
  publisher = {American Mathematical Society Press},
  address = {Providence, RI},
  URL = {http://www.cs.duke.edu/~jsv/Papers/Vit98.IO_survey.ps.gz},
  keywords = {out-of-core algorithm, pario-bib},
  comment = {Earlier shorter versions entitled "External Memory Algorithms"
  appear as an invited tutorial in Proceedings of the 17th ACM Symposium on
  Principles of Database Systems, Seattle, WA, June 1998, 119--128, and as an
  invited paper in Proceedings of the 6th Annual European Symposium on
  Algorithms, Venice, Italy, August 1998, 1--25, published in Lecture Notes in
  Computer Science, 1461, Springer-Verlag, Berlin}
}

@Article{vitter:uniform,
  author = {Jeffrey Scott Vitter and Mark H. Nodine},
  title = {Large-Scale Sorting in Uniform Memory Hierarchies},
  journal = {Journal of Parallel and Distributed Computing},
  year = {1993},
  month = {January and February},
  volume = {17},
  number = {1--2},
  pages = {107--114},
  publisher = {Academic Press},
  keywords = {parallel I/O algorithm, sorting, pario-bib},
  comment = {Summary is nodine:sort.}
}

@Manual{vms:stripe,
  key = {DEC},
  title = {{VAX} Disk Striping Driver for {VMS}},
  year = {1989},
  month = {December},
  organization = {Digital Equipment Corporation},
  note = {Order Number AA-NY13A-TE},
  keywords = {disk striping, pario-bib},
  comment = {Describes the VAX disk striping driver. Stripes an apparently
  arbitrary number of disk devices. All devices must be the same type, and
  apparently completely used. Manager can specify ``chunksize'', the number of
  logical blocks per striped block. They suggest using the track size of the
  device as the chunk size. They also point out that multiple controllers
  should be used in order to gain parallelism.}
}

@InProceedings{voelker:coop,
  author = {Geoffrey M. Voelker and Eric J. Anderson and Tracy Kimbrel and
  Michael J. Feeley and Jeffrey S. Chase and Anna R. Karlin and Henry M. Levy},
  title = {Implementing Cooperative Prefetching and Caching in a
  Globally-Managed Memory System},
  booktitle = {Proceedings of the Joint International Conference on Measurement
  and Modeling of Computer Systems},
  year = {1998},
  month = {June},
  pages = {33--43},
  publisher = {ACM Press},
  URL =
  {http://www.acm.org/pubs/citations/proceedings/metrics/277851/p33-voelker/},
  keywords = {distributed shared memory, cooperative caching, parallel I/O,
  pario-bib}
}

@InProceedings{vydyanathan:pipeline,
  author = {N. Vydyanathan and G. Khanna and T. Kurc and U. Catalyurek and P.
  Wyckoff and J. Saltz and P. Sadayappan Naga Vydyanathan and Gaurav Khana and
  Tahsin M Kurc and Umit V Catalyurek and Pete Wyckoff and Joel H Saltz and P.
  (Saday) Sadayappan},
  title = {Use of {PVFS} for efficient execution of jobs with pipeline-shared
  {I/O}},
  booktitle = {Proceedings of the 5th IEEE/ACM International Workshop on Grid
  Computing},
  editor = {Buyya, R},
  year = {2004},
  month = {November},
  pages = {235--242},
  institution = {Ohio State Univ, Dept Comp Sci \& Engn, Columbus, OH 43210
  USA},
  publisher = {IEEE Computer Society Press},
  copyright = {(c)2005 The Thomson Corporation},
  address = {Pittsburgh, PA},
  URL =
  {http://csdl.computer.org/comp/proceedings/grid/2004/2256/00/22560235abs.htm},
  keywords = {PVFS, pipelined-shared I/O, grid computing, pario-bib},
  abstract = {This paper is concerned with efficient execution of applications
  that are composed of chain of sequential data processes, which exchange data
  through a file system. We focus on pipeline-shared I/O behavior within a
  single pipeline of processes running on a cluster We examine several
  scheduling strategies and experimentally evaluate them for efficient use the
  Parallel Virtual File System (PVFS) as a common storage pool.}
}

@InProceedings{waltz:database,
  author = {David L. Waltz},
  title = {Innovative Massively Parallel {AI} Applications},
  booktitle = {Proceedings of the 1993 DAGS/PC Symposium},
  year = {1993},
  month = {June},
  pages = {132--138},
  organization = {Dartmouth Institute for Advanced Graduate Studies},
  address = {Hanover, NH},
  keywords = {database, AI, artificial intelligence, pario-bib},
  abstract = {Massively parallel applications must address problems that will
  be too large for workstations for the next several years, or else it will not
  make sense to expend development costs on them. Suitable applications include
  one or more of the following properties: 1) large amounts of data; 2)
  intensive computations; 3) requirement for very fast response times; 4) ways
  to trade computations for human effort, as in developing applications using
  learning methods. Most of the suitable applications that we have found come
  from the general area of very large databases. Massively parallel machines
  have proved to be important not only in being able to run large applications,
  but in accelerating development (allowing the use of simpler algorithms,
  cutting the time to test performance on realistic databases) and allowing
  many different algorithms and parameter settings to be tried and compared for
  a particular task. This paper summarizes four such applications. \par The
  applications described are: 1) prediction of credit card "defaulters"
  (non-payers) and "attritters" (people who didn't renew their cards) from a
  credit card database; 2) prediction of the continuation of time series, e.g.
  stock price movements; 3) automatic keyword assignment for news articles; and
  4) protein secondary structure prediction. These add to a list identified in
  an earlier paper [Waltz 90] including: 5) automatic classification of U.S.
  Census Bureau long forms, using MBR -- Memory-Based Reasoning [Creecy et al
  92, Waltz 89, Stanfill \& Waltz 86]; 6) generating catalogs for a mail order
  company that maximize expected net returns (revenues from orders minus cost
  of the catalogs and mailings) using genetically-inspired methods; and 7)
  text-based intelligent systems for information retrieval, decision support,
  etc.},
  comment = {Invited speaker.}
}

@TechReport{wang:paging,
  author = {Kuei Yu Wang and Dan C. Marinescu},
  title = {An Analysis of the Paging Activity of Parallel Programs, Part {I}:
  Correlation of the Paging Activity of Individual Node Programs in the {SPMD}
  Execution Mode},
  year = {1994},
  month = {June},
  number = {CSD-TR-94-042},
  institution = {Purdue University},
  keywords = {parallel I/O, virtual memory, paging, characterization,
  pario-bib},
  comment = {They measured the paging behavior of programs running on a
  Paragon, and analyze the results. To do so, they sample the OSF paging
  statistics periodically. The general conclusions: they found a surprising
  amount of dissimilarity in the paging behaviors of nodes within the same
  program, both in terms of the amount of paging and the timing of peak paging
  activity. These characteristics do not bode well for systems that use gang
  scheduling, or applications that have a lot of barriers.}
}

@InProceedings{wang:workload,
  author = {Feng Wang and Qin Xin and Bo Hong and Scott A. Brandt and Ethan L.
  Miller and Darrell D. E. Long and Tyce T. McLarty},
  title = {File system workload analysis for large scale scientific computing
  applications},
  booktitle = {Proceedings of the Twentieth IEEE/Eleventh NASA Goddard
  Conference on Mass Storage Systems and Technologies},
  year = {2004},
  month = {April},
  publisher = {IEEE Computer Society Press},
  address = {College Park, MD},
  URL = {http://ssrc.cse.ucsc.edu/Papers/wang-mss04.pdf},
  keywords = {file system workload, workload characterization, ASCI, lustre,
  scientific applications, pario-app, pario-bib},
  abstract = {Parallel scientific applications require high-performance I/O
  support from underlying file systems. A comprehensive understanding of the
  expected workload is therefore essential for the design of high-performance
  parallel file systems. We re-examine the workload characteristics in parallel
  computing environments in the light of recent technology advances and new
  applications. \par We analyze application traces from a cluster with hundreds
  of nodes. On average, each application has only one or two typical request
  sizes. Large requests from several hundred kilobytes to several megabytes are
  very common. Although in some applications, small requests account for more
  than 90% of all requests, almost all of the I/O data are transferred by large
  requests. All of these applications show bursty access patterns. More than
  65% of write requests have inter-arrival times within one millisecond in most
  applications. By running the same benchmark on different file models, we also
  find that the write throughput of using an individual output file for each
  node exceeds that of using a shared file for all nodes by a factor of 5. This
  indicates that current file systems are not well optimized for file
  sharing.},
  comment = {An I/O workload study of three applications on a 960 node
  (dual-processors) cluster at LLNL running the lustre-light parallel file
  system. The applications include a I/O benchmarking code (ior2) and two
  physics simulations: one that ran on 343 processors and one that ran on 1620
  processors.}
}

@InProceedings{watson:hpss,
  author = {Richard W. Watson and Robert A. Coyne},
  title = {The Parallel {I/O} Architecture of the High-Performance Storage
  System ({HPSS})},
  booktitle = {Proceedings of the Fourteenth IEEE Symposium on Mass Storage
  Systems},
  year = {1995},
  month = {September},
  pages = {27--44},
  publisher = {IEEE Computer Society Press},
  URL = {http://www.computer.org/conferen/mss95/watson/watson.htm},
  keywords = {mass storage, parallel I/O, multiprocessor file system interface,
  pario-bib},
  abstract = {Datasets up to terabyte size and petabyte total capacities have
  created a serious imbalance between I/O and storage-system performance and
  system functionality. One promising approach is the use of parallel
  data-transfer techniques for client access to storage,
  peripheral-to-peripheral transfers, and remote file transfers. This paper
  describes the parallel I/O architecture and mechanisms, parallel transport
  protocol (PTP), parallel FTP, and parallel client application programming
  interface (API) used by the high-performance storage system (HPSS). Parallel
  storage integration issues with a local parallel file system are also
  discussed.}
}

@InProceedings{weissman:smart,
  author = {Jon B. Weissman},
  title = {Smart File Objects: A Remote File Access Paradigm},
  booktitle = {Proceedings of the Sixth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1999},
  month = {May},
  pages = {89--97},
  publisher = {ACM Press},
  address = {Atlanta, GA},
  URL = {http://vibes.cs.uiuc.edu/IOPADS/Accepted/Weissman.ps},
  keywords = {object, parallel I/O, pario-bib},
  abstract = {This paper describes a new scheme for remote file access called
  Smart File Objects (SFO). The SFO is an object-oriented application-specific
  file access paradigm designed to attack the bottleneck imposed by high
  latency, low bandwidth networks such as wide-area and wireless networks. The
  SFO uses application and network information to adaptively prefetch needed
  data in parallel with the execution of the application. The SFO can offer
  additional advantages such as non-blocking I/O, bulk I/O, improved file
  access APIs, and increased reliability. We describe the SFO concept, a
  prototype implementation in the Mentat system, and the results obtained with
  a distributed gene sequence application running across the Internet and vBNS.
  The results show the potential of the SFO approach to improve application
  performance.}
}

@InProceedings{wickremesinghe:active-storage,
  author = {Rajiv Wickremesinghe and Jeffrey S. Chase and Jeffrey S. Vitter},
  title = {Distributed Computing with Load-Managed Active Storage},
  booktitle = {Proceedings of the Eleventh IEEE International Symposium on High
  Performance Distributed Computing},
  year = {2002},
  pages = {24--34},
  publisher = {IEEE Computer Society Press},
  address = {Edinburgh, Scotland},
  keywords = {I/O, active storage, TPIE, grid, parallel I/O, pario-bib},
  comment = {Very interesting talk... an extension of the TPIE work. They
  assign a mapping of computations to storage-based processors. This stuff is
  very similar to Armada. They place "functors" that have bounded per-record
  processing and bounded internal state at the ASU (active storage unit). Since
  functors have bounded computation and state, they have predictive behavior
  (used for load balancing and scheduling). The extensions to TPIE include data
  aggregation primitives for sets (unordered data), streams (sequential data),
  and arrays (random access data). They also allow functors to process
  "packets" (groups of records) useful for applications like a merge sort. The
  example applications include the standard TPIE GIS app, along with a merge
  sort.}
}

@InProceedings{wiebalck:enbd,
  author = {Arne Wiebalck and Peter T. Breuer and Volker Lindenstruth and Timm
  M. Steinbeck},
  title = {Fault-Tolerant Distributed Mass Storage for LHC Computing},
  booktitle = {Proceedings of the Third IEEE/ACM International Symposium on
  Cluster Computing and the Grid},
  year = {2003},
  month = {May},
  pages = {266--275},
  publisher = {IEEE Computer Society Press},
  address = {Tokyo, Japan},
  URL =
  {http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190266abs.htm},
  keywords = {RAID, fault-tolerance, high-energy physics, parallel I/O,
  pario-app, pario-bib},
  abstract = {In this paper we present the concept and first prototyping
  results of a modular fault-tolerant distributed mass storage architecture for
  large Linux PC clusters as they are deployed by the upcoming particle physics
  experiments. The device masquerading technique using an Enhanced Network
  Block Device (ENBD) enables local RAID over remote disks as the key concept
  of the ClusterRAID system. The block level interface to remote files,
  partitions or disks provided by the ENBD makes it possible to use the
  standard Linux software RAID to add fault-tolerance to the system.
  Preliminary performance measurements indicate that the latency is comparable
  to a local hard drive. With four disks throughput rates of up to 55MB/s were
  achieved with first prototypes for a RAID0 setup, and about 40MB/s for a
  RAID5 setup.}
}

@Article{wilkes:autoraid,
  author = {John Wilkes and Richard Golding and Carl Staelin and Tim Sullivan},
  title = {The {HP AutoRAID} Hierarchical Storage System},
  journal = {ACM Transactions on Computer Systems},
  year = {1996},
  month = {February},
  volume = {14},
  number = {1},
  pages = {108--136},
  publisher = {ACM Press},
  earlier = {wilkes:autoraid-sosp},
  later = {wilkes:bautoraid},
  URL =
  {http://www.acm.org/pubs/citations/journals/tocs/1996-14-1/p108-wilkes/},
  keywords = {RAID, disk array, parallel I/O, pario-bib}
}

@InProceedings{wilkes:autoraid-sosp,
  author = {John Wilkes and Richard Golding and Carl Staelin and Tim Sullivan},
  title = {The {HP AutoRAID} Hierarchical Storage System},
  booktitle = {Proceedings of the Fifteenth ACM Symposium on Operating Systems
  Principles},
  year = {1995},
  month = {December},
  pages = {96--108},
  publisher = {ACM Press},
  address = {Copper Mountain, CO},
  later = {wilkes:autoraid},
  URL =
  {http://www.hpl.hp.com/personal/John_Wilkes/papers/AutoRAID.SOSP95.ps.Z},
  keywords = {RAID, disk array, parallel I/O, pario-bib},
  comment = {Cite wilkes:autoraid. A commercial RAID box that transparently
  manages a hierarchy of two RAID systems, a RAID-1 mirrored system and a
  RAID-5 system. The goal is easy-to-use high performance, and they appear to
  have achieved that goal. Data in current use are kept in the RAID-1, and
  other data in RAID-5. This design gives performance of RAID-1 with cost of
  RAID-5. They have a clever scheme for spreading both RAIDs across most disks,
  including a hot spare. Dual controllers, power supplies, fans, etc. The
  design is a fairly standard RAID hardware controller, using standard SCSI
  disks, but with all the new tricks done in controller software. The paper
  gives a few results from the prototype hardware, and a lot of simulation
  results.}
}

@InCollection{wilkes:bautoraid,
  author = {John Wilkes and Richard Golding and Carl Staelin and Tim Sullivan},
  title = {The {HP AutoRAID} Hierarchical Storage System},
  booktitle = {High Performance Mass Storage and Parallel {I/O}: Technologies
  and Applications},
  chapter = {7},
  editor = {Hai Jin and Toni Cortes and Rajkumar Buyya},
  year = {2001},
  pages = {90--106},
  publisher = {IEEE Computer Society Press and Wiley},
  address = {New York, NY},
  earlier = {wilkes:autoraid},
  URL = {http://www.buyya.com/superstorage/},
  keywords = {RAID, disk array, parallel I/O, pario-bib},
  comment = {Part of jin:io-book; reformatted version of wilkes:autoraid.}
}

@TechReport{wilkes:datamesh,
  author = {John Wilkes},
  title = {{DataMesh}--- scope and objectives: a commentary},
  year = {1989},
  month = {July},
  number = {HP-DSD-89-44},
  institution = {Hewlett-Packard},
  later = {wilkes:datamesh1},
  keywords = {parallel I/O, distributed file system, disk caching,
  heterogeneous file system, pario-bib},
  comment = {Hooks a heterogeneous set of storage devices together over a fast
  interconnect, each with its own identical processor. The whole would then act
  as a file server for a network. Data storage devices would range from fast to
  slow (e.g. optical jukebox), varying availability, etc.. Many ideas here but
  few concrete suggestions. Very little mention of algorithms they might use to
  control the thing. See also wilkes:datamesh1, cao:tickertaip, chao:datamesh,
  wilkes:houses, wilkes:lessons.}
}

@InProceedings{wilkes:datamesh1,
  author = {John Wilkes},
  title = {{DataMesh} Research Project, Phase 1},
  booktitle = {Proceedings of the USENIX File Systems Workshop},
  year = {1992},
  month = {May},
  pages = {63--69},
  earlier = {wilkes:datamesh},
  keywords = {distributed file system, parallel I/O, disk scheduling, disk
  layout, pario-bib},
  comment = {See chao:datamesh}
}

@InProceedings{wilkes:datamesh2,
  author = {John Wilkes},
  title = {The {DataMesh} research project},
  booktitle = {Transputing '91},
  editor = {P. Welch et al.},
  year = {1991},
  pages = {547--533},
  publisher = {IOS Press},
  keywords = {parallel I/O, RAID, disk striping, pario-bib},
  comment = {An overview report on the DataMesh project. It adds a little to
  the earlier reports such as wilkes:datamesh1. It has some performance results
  from simulation.}
}

@Article{wilkes:houses,
  author = {John Wilkes},
  title = {{DataMesh}, house-building, and distributed systems technology},
  journal = {ACM Operating Systems Review},
  year = {1993},
  month = {April},
  volume = {27},
  number = {2},
  pages = {104--108},
  earlier = {wilkes:datamesh1},
  later = {wilkes:lessons},
  keywords = {file system, distributed computing, pario-bib},
  comment = {Same as wilkes:lessons. See that for comments.}
}

@InProceedings{wilkes:lessons,
  author = {John Wilkes},
  title = {{DataMesh}, house-building, and distributed systems technology},
  booktitle = {Proceedings of the 1993 DAGS/PC Symposium},
  year = {1993},
  month = {June},
  pages = {1--5},
  organization = {Dartmouth Institute for Advanced Graduate Studies},
  address = {Hanover, NH},
  earlier = {wilkes:houses},
  keywords = {file system, parallel I/O, RAID, disk array, pario-bib},
  comment = {Invited speaker. Also appeared in ACM OSR April 1993
  (wilkes:houses). This gives his viewpoint that we should be focusing more on
  architecture than on components, to design frameworks rather than just
  individual policies and mechanisms. It also gives a quick overview of
  DataMesh. For more DataMesh info, though, see cao:tickertaip, chao:datamesh,
  wilkes:datamesh1, wilkes:datamesh, wilkes:houses.}
}

@InProceedings{willeman:pario,
  author = {Ray Willeman and Susan Phillips and Ron Fargason},
  title = {An Integrated Library For Parallel Processing: The Input/Output
  Component},
  booktitle = {Proceedings of the Fourth Conference on Hypercube Concurrent
  Computers and Applications},
  year = {1989},
  pages = {573--575},
  publisher = {Golden Gate Enterprises, Los Altos, CA},
  address = {Monterey, CA},
  keywords = {parallel I/O, pario-bib},
  comment = {Like the CUBIX interface, in some ways. Meant for parallel access
  to non-striped (sequential) file. Self-describing format so that the reader
  can read the formatting information and distribute data accordingly.}
}

@InProceedings{wisniewski:in-place,
  author = {Leonard F. Wisniewski},
  title = {Structured Permuting in Place on Parallel Disk Systems},
  booktitle = {Proceedings of the Fourth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1996},
  month = {May},
  pages = {128--139},
  publisher = {ACM Press},
  address = {Philadelphia},
  keywords = {parallel I/O, parallel I/O algorithm, permutation, out-of-core,
  pario-bib},
  abstract = {The ability to perform permutations of large data sets in place
  reduces the amount of necessary available disk storage. The simplest way to
  perform a permutation often is to read the records of a data set from a
  source portion of data storage, permute them in memory, and write them to a
  separate target portion of the same size. It can be quite expensive, however,
  to provide disk storage that is twice the size of very large data sets.
  Permuting in place reduces the expense by using only a small amount of extra
  disk storage beyond the size of the data set. \par
  \newcommand{\ceil}[1]{\lceil #1\rceil} \newcommand{\rank}[1]{\mathop{\rm
  rank}\nolimits #1} This paper features in-place algorithms for commonly used
  structured permutations. We have developed an asymptotically optimal
  algorithm for performing BMMC (bit-matrix-multiply/complement) permutations
  in place that requires at most $\frac{2N}{BD}\left(
  2\ceil{\frac{\rank{\gamma}}{\lg (M/B)}} + \frac{7}{2}\right)$ parallel disk
  accesses, as long as $M \geq 2BD$, where $N$ is the number of records in the
  data set, $M$ is the number of records that can fit in memory, $D$ is the
  number of disks, $B$ is the number of records in a block, and $\gamma$ is the
  lower left $\lg (N/B) \times \lg B$ submatrix of the characteristic matrix
  for the permutation. This algorithm uses $N+M$ records of disk storage and
  requires only a constant factor more parallel disk accesses and insignificant
  additional computation than a previously published asymptotically optimal
  algorithm that uses $2N$ records of disk storage. \par We also give
  algorithms to perform mesh and torus permutations on a $d$-dimensional mesh.
  The in-place algorithm for mesh permutations requires at most $3\ceil{N/BD}$
  parallel I/Os and the in-place algorithm for torus permutations uses at most
  $4dN/BD$ parallel I/Os. The algorithms for mesh and torus permutations
  require no extra disk space as long as the memory size $M$ is at least $3BD$.
  The torus algorithm improves upon the previous best algorithm in terms of
  both time and space.}
}

@InProceedings{wisniewski:mpiio,
  author = {Len Wisniewski and Brad Smisloff and Nils Nieuwejaar},
  title = {{Sun MPI I/O}: Efficient {I/O} for Parallel Applications},
  booktitle = {Proceedings of SC99: High Performance Networking and Computing},
  year = {1999},
  month = {November},
  publisher = {ACM Press and IEEE Computer Society Press},
  address = {Portland, OR},
  URL = {http://www.sc99.org/proceedings/papers/wisniew.pdf},
  keywords = {MPI I/O, parallel file system, pario-bib},
  comment = {They describe the port of MPI I/O to the Sun Parallel File system
  (a direct descendent of Galley).}
}

@Misc{wisniewski:sun-mpi-io,
  author = {Len Wisniewski and Brad Smisloff and Nils Nieuwejaar},
  title = {{Sun MPI I/O}: Efficient {I/O} for Parallel Applications},
  booktitle = {Proceedings of SC99: High Performance Networking and Computing},
  year = {1999},
  month = {November},
  pages = {19},
  publisher = {ACM Press},
  URL = {http://www.sc99.org/proceedings/papers/wisniew.pdf},
  keywords = {MPI I/O, parallel I/O, multiprocessor file system interface,
  pario-bib}
}

@InProceedings{witkowski:hyper-fs,
  author = {Andrew Witkowski and Kumar Chandrakumar and Greg Macchio},
  title = {Concurrent {I/O} System for the Hypercube Multiprocessor},
  booktitle = {Proceedings of the Third Conference on Hypercube Concurrent
  Computers and Applications},
  year = {1988},
  pages = {1398--1407},
  publisher = {ACM Press},
  address = {Pasadena, CA},
  keywords = {parallel I/O, hypercube, parallel file system, pario-bib},
  comment = {Concrete system for the hypercube. Files resident on one disk
  only. Little support for cooperation except for sequentialized access to
  parts of the file, or broadcast. No mention of random-access files. I/O nodes
  are distinguished from computation nodes. I/O nodes have separate comm.
  network. No parallel access. I/O hooked to front-end too.}
}

@InProceedings{wolf:dasd,
  author = {Joel L. Wolf and Philip S. Yu and Hadas Shachnai},
  title = {{DASD} Dancing: A Disk Load Balancing Optimization Scheme for
  Video-on-Demand Computer Systems},
  booktitle = {Proceedings of the 1995 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1995},
  month = {May},
  pages = {157--166},
  keywords = {parallel I/O, video server, multimedia, pario-bib}
}

@Article{wolman:iobench,
  author = {Barry L. Wolman and Thomas M. Olson},
  title = {{IOBENCH:} A System Independent {IO} Benchmark},
  journal = {Computer Architecture News},
  year = {1989},
  month = {September},
  volume = {17},
  number = {5},
  pages = {55--70},
  keywords = {I/O benchmark, transaction processing, pario-bib},
  comment = {Not about parallel I/O, but see olson:random. Defines a new I/O
  benchmark that is fairly system-independent. Focus is for transaction
  processing systems. Cranks up many tasks (users) all doing repetitive
  read/writes for a specified time, using optional locking, and optional
  computation. Whole suite of results for comparison with others. See also
  chen:iobench.}
}

@Article{womble:intro,
  author = {David E. Womble and David S. Greenberg},
  title = {Parallel {I/O}: An introduction},
  journal = {Parallel Computing},
  year = {1997},
  month = {June},
  volume = {23},
  number = {4},
  pages = {403--417},
  publisher = {North-Holland (Elsevier Scientific)},
  keywords = {parallel I/O, pario-bib},
  comment = {A brief introduction to the topic of parallel I/O (what, why,
  current research), followed by a roundtable discussion among the authors of
  the papers in womble:special-issue. The discussion focused on three
  questions: 1) What are the biggest gaps in current I/O services? 2) Why have
  vendors failed to adopt new file system technologies? 3) How much direct
  low-level control over I/O resources should be given to the users and why?}
}

@InProceedings{womble:outofcore,
  author = {David Womble and David Greenberg and Rolf Riesen and Stephen
  Wheat},
  title = {Out of Core, Out of Mind: Practical Parallel {I/O}},
  booktitle = {Proceedings of the Scalable Parallel Libraries Conference},
  year = {1993},
  month = {October},
  pages = {10--16},
  address = {Mississippi State University},
  URL = {ftp://ftp.cs.sandia.gov/pub/papers/dewombl/parallel_io_scl93.ps.Z},
  keywords = {parallel I/O, parallel file system, pario-bib},
  abstract = {Parallel computers are becoming more powerful and more complex in
  response to the demand for computing power by scientists and engineers.
  Inevitably, new and more complex I/O systems will be developed for these
  systems. In particular we believe that the I/O system must provide the
  programmer with the ability to explicitly manage storage (despite the trend
  toward complex parallel file systems and caching schemes). One method of
  doing so is to have a partitioned secondary storage in which each processor
  owns a logical disk. Along with operating system enhancements which allow
  overheads such as buffer copying to be avoided and libraries to support
  optimal remapping of data, this sort of I/O system meets the needs of high
  performance computing.},
  comment = {They argue that it is important to allow the programmer to
  explicitly control their storage in some way. In particular, they advocate
  the Partitioned Secondary Storage (PSS) model, in which each processor has
  its own logical disk, rather than using a parallel file system (PFS) which
  automatically stripes a linear file across many disks. Basically, programmer
  knows best. Of course, libraries can help. They note that you will often need
  data in a different format than it comes, and may need it output in a
  different format; so, permutation algorithms are needed. Also important to be
  able to overlap computation with I/O. They use LU factorization as an
  example, and give an algorithm. On the nCUBE with the PUMA operating system,
  they get good performance. See womble:pario.}
}

@InProceedings{womble:pario,
  author = {David Womble and David Greenberg and Stephen Wheat and Rolf
  Riesen},
  title = {Beyond Core: Making Parallel Computer {I/O} Practical},
  booktitle = {Proceedings of the 1993 DAGS/PC Symposium},
  year = {1993},
  month = {June},
  pages = {56--63},
  organization = {Dartmouth Institute for Advanced Graduate Studies},
  address = {Hanover, NH},
  URL = {ftp://ftp.cs.sandia.gov/pub/papers/dewombl/parallel_io_dags93.ps.Z},
  keywords = {parallel I/O, out-of-core, parallel algorithm, scientific
  computing, multiprocessor file system, pario-bib},
  abstract = {The solution of Grand Challenge Problems will require
  computations which are too large to fit in the memories of even the largest
  machines. Inevitably, new designs of I/O systems will be necessary to support
  them. Through our implementations of an out-of-core LU factorization we have
  learned several important lessons about what I/O systems should be like. In
  particular we believe that the I/O system must provide the programmer with
  the ability to explicitly manage storage. One method of doing so is to have a
  partitioned secondary storage in which each processor owns a logical disk.
  Along with operating system enhancements which allow overheads such as buffer
  copying to be avoided, this sort of I/O system meets the needs of high
  performance computing.},
  comment = {See womble:outofcore. See thakur:runtime, kotz:lu, brunet:factor
  for other out-of-core LU results.}
}

@Article{womble:special-issue,
  author = {David E. Womble and David S. Greenberg},
  title = {Parallel {I/O}},
  journal = {Parallel Computing},
  year = {1997},
  month = {June},
  volume = {23},
  number = {4},
  pages = {iii},
  publisher = {North-Holland (Elsevier Scientific)},
  note = {Introduction to a special issue.},
  keywords = {parallel I/O, pario-bib},
  comment = {A one-page introduction to this special issue of Parallel
  Computing, which includes many papers about parallel I/O. See also
  womble:intro, nieuwejaar:jgalley, moore:ddio, barve:jmergesort, miller:jrama,
  schwabe:jlayouts, parsons:templates, cormen:early-vic,
  carretero:performance,}
}

@TechReport{wong:benchmarks,
  author = {Parkson Wong and Rob F Van der Wijngaart},
  title = {{NAS} Parallel Benchmarks {I/O} Version 2.4},
  year = {2003},
  month = {January},
  number = {NAS-03-002},
  institution = {Computer Sciences Corporation, NASA Advanced Supercomputing
  (NAS) Division},
  address = {NASA Ames Research Center, Moffett Field, CA 94035-1000},
  URL = {http://www.nas.nasa.gov/News/Techreports/2003/PDF/nas-03-002.pdf},
  keywords = {parallel I/O benchmarks, block tridiagonal, pario-app,
  pario-bib},
  abstract = {We describe a benchmark problem, based on the Block-Tridiagonal
  (BT) problem of the NAS Parallel Benchmarks (NPB), which is used to test the
  output capabilities of high-performance computing systems, especially
  parallel systems. We also present a source code implementation of the
  benchmark, called NPBIO2.4-MPI, based on the MPI implementation of the NPB,
  using a variety of ways to write the computed solutions to file.}
}

@Article{woodward:scivi,
  author = {Paul R. Woodward},
  title = {Interactive Scientific Visualization of Fluid Flow},
  journal = {IEEE Computer},
  year = {1993},
  month = {October},
  volume = {26},
  number = {10},
  pages = {13--25},
  publisher = {IEEE Computer Society Press},
  keywords = {parallel I/O architecture, scientific visualization, pario-bib},
  comment = {This paper is interesting for its impressive usage of RAIDs and
  parallel networks to support scientific visualization. In particular, the
  proposed Gigawall (a 10-foot by 6-foot gigapixel-per-second display) is run
  by 24 SGI processors and 32 9-disk RAIDs, connected to an MPP of some kind
  through an ATM switch. 512 GBytes of storage, playable at 450 MBytes per
  second, for 19 minutes of animation.}
}

@Article{worringen:improving,
  author = {Joachim Worringen and Jesper Larsson Tr\"{a}ff and Hubert
  Ritzdorf},
  title = {Improving generic non-contiguous file access for {MPI-IO}},
  journal = {Lecture Notes in Computer Science},
  booktitle = {10th European Parallel Virtual Machine and Message Passing
  Interface Users Group Meeting (PVM/MPI); September 29 - October 2, 2003;
  VENICE, ITALY},
  editor = {Dongarra, J; Laforenza, D; Orlando, S},
  year = {2003},
  month = {October},
  volume = {2840},
  pages = {309--318},
  institution = {NEC Europe Ltd, C\&C Res Labs, Rathausallee 10, D-53757 St
  Augustin, Germany; NEC Europe Ltd, C\&C Res Labs, D-53757 St Augustin,
  Germany},
  publisher = {Springer-Verlag Heidelberg},
  copyright = {(c)2005 The Thomson Corporation},
  URL =
  {http://www.springerlink.com/openurl.asp?genre=article&issn=0302-9743&volume=2840&spage=309},
  keywords = {access patterns, MPI-IO, listless I/O, pario-bib},
  abstract = {We present a fundamental improvement of the generic techniques
  for non-contiguous file access in MPI-IO. The improvement consists in the
  replacement of the conventional data management algorithms based on a
  representation of the non-contiguous fileview as a list of (offset, length)
  tuples. The improvement is termed listless i/o as it instead makes use of
  space- and time-efficient datatype handling functionality that is completely
  free of lists for processing non-contiguous data in the file or in memory.
  Listless i/o has been implemented for both independent and collective file
  accesses and improves access performance by increasing the data throughput
  between user buffers and file buffers. Additionally, it reduces the memory
  footprint of the process performing non-contiguous I/O. In this paper we give
  results for a synthetic benchmark on a PC cluster using different file
  systems. We demonstrate improvements in I/O bandwidth that exceed a factor of
  10.},
  comment = {Also see worringen:non-contiguous}
}

@InProceedings{worringen:non-contiguous,
  author = {Joachim Worringen and Jesper Larson Tr\"{a}ff and Hubert Ritzdorf},
  title = {Fast Parallel Non-Contiguous File Access},
  booktitle = {Proceedings of SC2003: High Performance Networking and
  Computing},
  year = {2003},
  month = {November},
  publisher = {IEEE Computer Society Press},
  address = {Phoenix, AZ},
  URL = {http://www.sc-conference.org/sc2003/paperpdfs/pap319.pdf},
  keywords = {parallel I/O interface, file access patterns, pario-bib},
  abstract = {Many applications of parallel I/O perform non-contiguous file
  accesses, but only few file system interfaces support non-contiguous access.
  In contrast, the most commonly used parallel programming interface, MPI,
  supports parallel I/O through its MPI-IO interface. Within this interface,
  non-contiguous accesses are supported by the use of derived MPI datatypes.
  Unfortunately, current MPI-IO implementations suffer from low performance of
  such non-contiguous accesses when compared to the performance of the storage
  system for contiguous accesses although a considerable amount of work has
  been done in this area. In this paper we analyze an important bottleneck in
  current implementations of MPI-IO, and present a new technique termed
  listless i/o to perform non-contiguous access with MPI-IO. On the NEC
  SX-series of parallel vector computers, listless i/o is able to increase the
  bandwidth for non-contiguous file access by sometimes more than a factor of
  500 when compared to the traditional approach.},
  comment = {published on the web}
}

@InProceedings{worringen:sci-io,
  author = {Joachim Worringen},
  title = {Efficient parallel {I/O} on {SCI} connected clusters.},
  booktitle = {IEEE International Conference on Cluster Computing. CLUSTER
  2000},
  year = {2000},
  month = {December},
  pages = {371--372},
  publisher = {Los Alamitos, CA, USA : IEEE Comput. Soc, 2000},
  copyright = {(c)2005 IEE},
  address = {Chemnitz, Germany},
  URL =
  {http://www.lfbs.rwth-aachen.de/users/joachim/publications/shortpaper_ieee_cluster2000.pdf},
  keywords = {parallel I/O, MPI-IO, SCI connected clusters, pario-bib},
  abstract = {This paper presents a new approach towards parallel I/O for
  message-passing (MPI) applications on clusters built with commodity hardware
  and an SCI interconnect: instead of using the classic scheme of clients and a
  number of servers communicating via TCP/IP, a pure peer-to-peer communication
  topology based on efficient use of the underlying SCI interconnect is
  presented. Every process of the MPI application is client as well as server
  for I/O operations. This allows for a maximum of locality in file access,
  while the accesses to remote portions of the distributed file are performed
  via distributed shared memory techniques. A server is only required to manage
  the initial distribution of the file fragments between the participating
  nodes and to provide services like external access and redundancy. (5
  refs.)},
  comment = {Short paper and a poster. Poster
  URL=<http://www.lfbs.rwth-aachen.de/users/joachim/publications/poster_ieee_cluster2000.pdf>.}
}

@InProceedings{wu:noncontiguous,
  author = {Jiesheng Wu and Pete Wyckoff and Dhabaleswar Panda},
  title = {Supporting efficient noncontiguous access in {PVFS} over
  {I}nfiniBand},
  booktitle = {Proceedings of the IEEE International Conference on Cluster
  Computing},
  year = {2003},
  month = {December},
  pages = {344--351},
  institution = {Ohio State Univ, Columbus, OH 43210 USA},
  publisher = {IEEE Computer Society Press},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  address = {Hong Kong, China},
  URL =
  {http://nowlab.cis.ohio-state.edu/projects/mpi-iba/publication/wuj-cluster03.pdf},
  keywords = {noncontiguous access patterns, PVFS, Infiniband, RDMA,
  pario-bib},
  abstract = {Noncontiguous I/O access is the main access pattern in many
  scientific applications. Noncontiguity exists both in access to files and in
  access to target memory regions on the client. This characteristic imposes a
  requirement of native noncontiguous I/O access support in cluster file
  systems for high performance. In this paper we address noncontiguous data
  transmission between the client and the I/O server in cluster file systems
  over a high performance network. We propose a novel approach, RDMA
  Gather/Scatter, to transfer noncontiguous data for such I/O accesses. We also
  propose a new scheme, Optimistic Group Registration, to reduce memory
  registration costs associated with this approach. We have designed and
  incorporated this approach in a version of PVFS over InfiniBand. Through a
  range of PVFS and MPI-IO micro-benchmarks, and the NAS BTIO benchmark, we
  demonstrate that our approach attains significant performance gains compared
  to other existing approaches.}
}

@InProceedings{wu:thrashing,
  author = {Kun-Lung Wu and Philip S. Yu and James Z. Teng},
  title = {Performance Comparison of Thrashing Control Policies for Concurrent
  Mergesorts with Parallel Prefetching},
  booktitle = {Proceedings of the 1993 ACM SIGMETRICS Conference on Measurement
  and Modeling of Computer Systems},
  year = {1993},
  pages = {171--182},
  keywords = {disk prefetching, parallel I/O, disk caching, sorting,
  pario-bib},
  comment = {They discuss prefetching and caching in database machines where
  mergesorts merge several input streams, each from its own disk, to one output
  stream, to its own disk. There are concurrent merges going on. A merge can
  cause thrashing when writes grab a clean buffer that holds an unused
  prefetch, thus forcing the block to later be read again. They consider
  several policies to handle this, but it seemed to me like they missed an
  obvious alternative, that may have been better: whenever you need a clean
  buffer to write into, but all the clean buffers hold unused-prefetched
  blocks, stall and wait while the dirty blocks are flushed (presumably started
  earlier when the clean-block count got too low). It seems better to wait for
  a half-finished write than to toss out a prefetched block and later have to
  read it again. Their simulations show that their techniques help a lot.}
}

@InProceedings{yamamoto:astronomical,
  author = {Naotaka Yamamoto and Osama Tatebe and Satoshi Sekiguchi},
  title = {Parallel and distributed astronomical data analysis on grid
  datafarm},
  booktitle = {5th International Workshop on Grid Computing},
  editor = {Buyya, R},
  year = {2004},
  month = {November},
  pages = {461--466},
  institution = {AIST, Grid Technol Res Ctr, Tsukuba Cent 2, Umezono 1-1-1,
  Tsukuba, Ibaraki, Japan; AIST, Grid Technol Res Ctr, Tsukuba, Ibaraki,
  Japan},
  publisher = {IEEE Computer Society Press},
  copyright = {(c)2005 The Thomson Corporation},
  address = {Pittsburgh, PA},
  URL = {http://datafarm.apgrid.org/pdf/Grid2004-yamamoto.pdf},
  keywords = {grid, grid datafarm, astronomical data, pario-app, pario-bib},
  abstract = {A comprehensive study of the whole petabyte-scale archival data
  of astronomical observatories has a possibility of new science and new
  knowledge in the field, while it was not feasible so far due to lack of
  enough data analysis environment. The Grid Datafarm architecture is designed
  for global petabyte-scale data-intensive computing, which provides a Grid
  file system with file replica management for fault tolerance and load
  balancing, and parallel and distributed data computing support for a set of
  files, to meet with the requirements of the comprehensive study of the whole
  archival data. In the paper, we discuss about worldwide parallel and
  distributed data analysis in the observational astronomical field The
  archival data is stored, replicated and dispersed in a Gfarm file system. All
  the astronomical data analysis tools successfully access files in Gfarm file
  system without any code modification, using a syscall hooking library
  regardless of file replica locations. Performance evaluation of the parallel
  data analysis in several ways shows file-affinity process scheduling plays an
  essential role for scalable and efficient parallel file I/O performance. A
  data calibration tools shows scalable file I/O performance, and achieved the
  file I/O performance of 5.9 GB/sec and 4.0 GB/sec for reading and writing
  FITS files, respectively, using 30 cluster nodes (60 CPUs). On-demand file
  replica creation mitigates the overhead of access concentration. Another tool
  shows the performance improvement at a factor of six for reading a shared
  file by creating file replicas.}
}

@Article{yang:construction,
  author = {Chao-Tung Yang and Chien-Tung Pan and Kuan-Ching Li and Wen-Kui
  Chang},
  title = {On construction of a large file system using {PVFS} for grid},
  journal = {Lecture Notes in Computer Science},
  booktitle = {5th International Conference on Parallel and Distributed
  Computing; December 8-10, 2004; Singapore, SINGAPORE},
  editor = {Liew, KM; Shen, H; See, S; Cai, W; Fan, P; Horiguchi, S},
  year = {2004},
  month = {December},
  volume = {3320},
  pages = {860--863},
  institution = {Tunghai Univ, High Performance Comp Lab, Dept Comp Sci \&
  Informat Engn, Taichung 407, Taiwan; Providence Univ, Parallel \& Distributed
  Proc Ctr, Dept Comp Sci \& Informat Management, Taichung 433, Taiwan},
  publisher = {SPRINGER-VERLAG BERLIN},
  copyright = {(c)2005 The Thomson Corporation},
  URL =
  {http://www.springerlink.com/openurl.asp?genre=article&issn=0302-9743&volume=3320&spage=860},
  keywords = {grid I/O, PVFS2, cluster file system, pario-bib},
  abstract = {Grid is the largest advance of network after Internet since the
  Grid System provides a specialty that can be used popularly and effectively.
  However, it is a challenge to the consistency and community of use on the
  data storages space of a Grid System. Therefore, the problem of application
  for the Computational Grid and Data Grid is more important. It can set up a
  usability, expandability, high operation capability, and large memory space
  in Grid with the Cluster system and parallel technique in order to solve the
  problem. In this paper, we provided a Grid with high operation capability and
  higher memories to solve the problem. As to the Grid setting, we take use of
  the Cluster computing to increase the operation effect for computing, and a
  PVFS2 with more storages effect for data. It can supply a quite correct
  platform for Grid user whether for large data access or huge operation.}
}

@InProceedings{yokota:nets,
  author = {Haruo Yokota},
  title = {{DR-Nets}: Data-Reconstruction Networks for Highly Reliable Parallel
  Disk Systems},
  booktitle = {Proceedings of the IPPS~'94 Workshop on Input/Output in Parallel
  Computer Systems},
  year = {1994},
  month = {April},
  pages = {105--116},
  organization = {Japan Advanced Institute of Science and Technology (JAIST)},
  note = {Also appeared in Computer Architecture News 22(4)},
  later = {yokota:nets-book},
  keywords = {parallel I/O, pario-bib},
  comment = {They propose to link a set of disks with its own interconnect,
  e.g., a torus, to allow the disks to communicate to compute multi-dimensional
  parity and to respond to disk failures, without using the primary
  interconnect of the multiprocessor or distributed system. In this sense it is
  reminiscent of TickerTAIP or DataMesh.}
}

@InCollection{yokota:nets-book,
  author = {Haruo Yokota and Yasuyuki Mimatsu},
  title = {A Scalable Disk System with Data Reconstruction Functions},
  booktitle = {Input/Output in Parallel and Distributed Computer Systems},
  chapter = {16},
  editor = {Ravi Jain and John Werth and James C. Browne},
  crossref = {iopads-book},
  year = {1996},
  series = {The Kluwer International Series in Engineering and Computer
  Science},
  volume = {362},
  pages = {353--372},
  publisher = {Kluwer Academic Publishers},
  earlier = {yokota:nets},
  keywords = {parallel I/O architecture, disk array, pario-bib},
  abstract = {Scalable disk systems are required to implement well-balanced
  computer systems. We have proposed DR-nets, Data-Reconstruction networks, to
  construct the scalable parallel disk systems with high reliability. Each node
  of a DR-net has disks, and is connected by links to form an interconnection
  network. To realize the high reliability, nodes in a sub-network of the
  interconnection network organize a group of parity calculation proposed for
  RAIDs. Inter-node communication for calculating parity keeps the locality of
  data transfer, and it inhibits bottlenecks from occurring, even if the size
  of the network becomes very large. We have developed an experimental system
  using Transputers. In this chapter, we provide execution models for
  estimating the response time and throughput of DR-nets, and compare them to
  experimental results. We also discuss the reliability of the DR-nets and
  RAIDs.},
  comment = {Part of a whole book on parallel I/O; see iopads-book.}
}

@MastersThesis{youssef:thesis,
  author = {Rachad Youssef},
  title = {{RAID} for Mobile Computers},
  year = {1995},
  month = {August},
  school = {Carnegie Mellon University Information Networking Institute},
  note = {Available as INI-TR 1995-3},
  URL =
  {http://www.cs.cmu.edu/afs/cs.cmu.edu/project/pdl/ftp/MOBILE/thesis.ps},
  keywords = {parallel I/O, disk array, RAID, mobile computing, pario-bib},
  comment = {low-power, highly available disk arrays for mobile computers.}
}

@Article{yu:modeling,
  author = {S. Yu and M. Winslett and J. Lee and X. Ma},
  title = {Automatic and Portable Performance Modeling for Parallel {I/O}: A
  Machine-Learning Approach},
  journal = {ACM SIGMETRICS Performance Evaluation Review},
  year = {2002},
  month = {December},
  volume = {30},
  number = {3},
  pages = {3--5},
  publisher = {ACM Press},
  URL = {http://doi.acm.org/10.1145/605521.605524},
  keywords = {parallel I/O, performance model, pario-bib},
  abstract = {A performance model for a parallel I/O system is essential for
  detailed performance analyses, automatic performance optimization of I/O
  request handling, and potential performance bottleneck identification. Yet
  how to build a portable performance model for parallel I/O system is an open
  problem. In this paper, we present a machine-learning approach to automatic
  performance modeling for parallel I/O systems. Our approach is based on the
  use of a platform- independent performance metamodel, which is a radial basis
  function neural network. Given training data, the metamodel generates a
  performance model automatically and efficiently for a parallel I/O system on
  a given platform. Experiments suggest that our goal of having the generated
  model provide accurate performance predictions is attainable, for the
  parallel I/O library that served as our experimental testbed on an IBM SP.
  This suggests that it is possible to model parallel I/O system performance
  automatically and portably, and perhaps to model a broader class of storage
  systems as well.}
}

@InProceedings{yu:trading,
  author = {Xiang Yu and Benjamin Gum and Yuqun Chen and Randolph Y. Wang and
  Kai Li and Arvind Krishnamurthy and Thomas E. Anderson},
  title = {Trading Capacity for Performance in a Disk Array},
  booktitle = {Proceedings of the 2000 Symposium on Operating Systems Design
  and Implementation},
  year = {2000},
  month = {October},
  pages = {243--258},
  publisher = {USENIX Association},
  address = {San Diego},
  URL =
  {http://www.usenix.org/publications/library/proceedings/osdi2000/yugum.html},
  keywords = {disk array, file system, parallel I/O, pario-bib},
  abstract = {A variety of performance-enhancing techniques, such as striping,
  mirroring, and rotational data replication, exist in the disk array
  literature. Given a fixed budget of disks, one must intelligently choose what
  combination of these techniques to employ. In this paper, we present a way of
  designing disk arrays that can flexibly and systematically reduce seek and
  rotational delay in a balanced manner. We give analytical models that can
  guide an array designer towards optimal configurations by considering both
  disk and workload characteristics. We have implemented a prototype disk array
  that incorporates the configuration models. In the process, we have also
  developed a robust disk head position prediction mechanism without any
  hardware support. The resulting prototype demonstrates the effectiveness of
  the configuration models.}
}

@Article{zabback:reorg,
  author = {Peter Zabback and Ibrahim Onyuksel and Peter Scheuermann and
  Gerhard Weikum},
  title = {Database Reorganization in Parallel Disk Arrays with {I/O} Service
  Stealing},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  year = {1998},
  month = {September/October},
  volume = {10},
  number = {5},
  pages = {855--858},
  URL = {http://ieeexplore.ieee.org/xpl/tocresult.jsp?isNumber=15727},
  keywords = {parallel I/O, disk array, database, disk reorganization,
  pario-bib}
}

@InProceedings{zajcew:osf1,
  author = {Roman Zajcew and Paul Roy and David Black and Chris Peak and Paulo
  Guedes and Bradford Kemp and John LoVerso and Michael Leibensperger and
  Michael Barnett and FaraMarz Rabii and Durriya Netterwala},
  title = {An {OSF/1 UNIX} for Massively Parallel Multicomputers},
  booktitle = {Proceedings of the 1993 Winter USENIX Technical Conference},
  year = {1993},
  month = {January},
  pages = {449--468},
  keywords = {unix, parallel operating system, multiprocessor file system,
  pario-bib},
  comment = {Describing the changes to OSF/1 to make OSF/1 AD TNC, primarily
  intended for NORMA MIMD multicomputers. Enhancements include a new file
  system, distributed implementation of sockets, and process management. The
  file system still has traditional file systems, each in its own partition,
  with a global name space built by mounting file systems on each other. The
  change is that mounts can be remote, ie, managed by a different file server
  on another node. They plan to use prefix tables for pathname translation
  (welch:prefix,nelson:sprite). They use a token-based protocol to provide
  atomicity of read and write calls, and to maintain consistency of client-node
  caches. See also roy:unixfile. Process enhancements include a new SIGMIGRATE,
  rfork(), and rforkmulti().}
}

@InProceedings{zhang:n-spek,
  author = {Ming Zhang and Qing Yang},
  title = {Performability Evaluation of Networked Storage Systems Using
  N-SPEK},
  booktitle = {Workshop on Parallel I/O in Cluster Computing and Computational
  Grids},
  year = {2003},
  month = {May},
  pages = {736--741},
  publisher = {IEEE Computer Society Press},
  address = {Tokyo},
  note = {Organized at the IEEE/ACM International Symposium on Cluster
  Computing and the Grid 2003},
  URL =
  {http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190736abs.htm},
  keywords = {benchmarking, performance, block-level access, pario-bib},
  abstract = {This paper introduces a new benchmark tool for evaluating
  performance and availability (performability) of networked storage systems,
  specifically storage area network (SAN) that is intended for providing
  block-level data storage with high performance and availability. The new
  benchmark tool, named N-SPEK (Networked-Storage Performability Evaluation
  Kernel module), consists of a controller, several workers, one or more
  probers, and several fault injection modules. N-SPEK is highly accurate and
  efficient since it runs at kernel level and eliminates skews and overheads
  caused by file systems. It allows a SAN architect to generate configurable
  storage workloads to the SAN under test and to inject different faults into
  various SAN components such as network devices, storage devices, and
  controllers. Available performances under different workloads and failure
  conditions are dynamically collected and recorded in the N-SPEK over a
  spectrum of time. To demonstrate its functionality, we apply N-SPEK to
  evaluate the performability of a specific iSCSI-based SAN under Linux
  environment. Our experiments show that N-SPEK not only efficiently generates
  quantitative performability results but also reveals a few optimization
  opportunities for future iSCSI implementations.}
}

@InProceedings{zhou:greedy,
  author = {Xinrong Zhou and Tong Wei},
  title = {A Greedy {I/O} Scheduling Method in the Storage System of Clusters},
  booktitle = {Workshop on Parallel I/O in Cluster Computing and Computational
  Grids},
  year = {2003},
  month = {May},
  pages = {712--717},
  publisher = {IEEE Computer Society Press},
  address = {Tokyo},
  note = {Organized at the IEEE/ACM International Symposium on Cluster
  Computing and the Grid 2003},
  URL =
  {http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190712abs.htm},
  keywords = {parallel I/O, disk scheduling, pario-bib},
  abstract = {As the size of cluster becomes larger, the process ability of a
  cluster increases rapidly. Users will exploit this increased power to run
  scientific, physical and multimedia applications. These kinds of
  data-intensive applications require high performance storage subsystem.
  Parallel storage system such as RAID is widely used in today's clusters. In
  this paper, we bring out a "greedy" I/O scheduling method that utilizes
  Scatter and Gather operations inside the PCI-SCSI adapter to combine as many
  I/O operations within the same disk as possible. In this way we reduce the
  numbers of I/O operations and improve the performance of the whole storage
  system. After analyzing RAID control strategy, we find out that I/O commands'
  combination may also bring up data movement in memory and this kind of
  movement will increase the system's overhead. The experiment results in our
  real time operating environment show that a better performance can be
  achieved. The longer the data length is, the better improvement we can get,
  in some case, we can even get over 40% enhancement.}
}

@InProceedings{zhou:threads,
  author = {Yuanyuan Zhou and Limin Wang and Douglas W. Clark and Kai Li},
  title = {Thread Scheduling for Out-of-Core Applications with Memory Server on
  Multicomputers},
  booktitle = {Proceedings of the Sixth Workshop on Input/Output in Parallel
  and Distributed Systems},
  year = {1999},
  month = {May},
  pages = {57--67},
  publisher = {ACM Press},
  address = {Atlanta, GA},
  URL = {http://vibes.cs.uiuc.edu/IOPADS/Accepted/Zhou.ps},
  keywords = {threads, scheduling, memory, out-of-core application, parallel
  I/O, pario-bib},
  abstract = {Out-of-core applications perform poorly in paged virtual memory
  (VM) systems because demand paging involves slow disk I/O accesses. Much
  research has been done on reducing the I/O overhead in such applications by
  either reducing the number of I/Os or lowering the cost of each I/O
  operation. In this paper, we investigate a method that combines fine-grained
  threading with a memory server model to improve the performance of
  out-of-core applications on multicomputers. The memory server model decreases
  the average cost of I/O operations by paging to remote memory, while the
  fine-grained thread scheduling reduces the number of I/O accesses by
  improving the data locality of applications. We have evaluated this method on
  an Intel Paragon with 7 applications. Our results show that the memory server
  system performs better than the VM disk paging by a factor of 5 for
  sequential applications and a factor of 1.5 to 2.2 for parallel applications.
  The fine-grained threading alone improves the VM disk paging performance by a
  factor of 10 and 1.2 to 3 respectively for sequential and parallel
  applications. Overall, the combination of these two techniques outperforms
  the VM disk paging by more than a factor of 12 for sequential applications
  and a factor of 3 to 6 for parallel applications.}
}

@InProceedings{zhu:case-study,
  author = {Y. F. Zhu and H. Jiang and X. Qin and D. Swanson},
  title = {A case study of parallel I/O for biological sequence search on Linux
  clusters},
  booktitle = {Proceedings of the IEEE International Conference on Cluster
  Computing},
  year = {2003},
  month = {December},
  pages = {308--315},
  institution = {Univ Nebraska, Dept Comp Sci \& Engn, Lincoln, NE 68588 USA},
  publisher = {IEEE Computer Society Press},
  copyright = {(c)2004 Institute for Scientific Information, Inc.},
  address = {Hong Kong, China},
  keywords = {BLAST, CEFT-PVFS, parallel I/O, PVFS, application,
  characterization, I/O access patterns, biology application, pario-app,
  pario-bib},
  abstract = {In this paper we analyze the I/O access patterns of a widely-used
  biological sequence search tool and implement two variations that employ
  parallel-I/O for data access based on PVFS (Parallel Virtual File System) and
  CEFT-PVFS (Cost-Effective Fault-Tolerant PVFS). Experiments show that the two
  variations outperform the original tool when equal or even fewer storage
  devices are used in the former. It is also found that although the
  performance of the two variations improves consistently when initially
  increasing the number of servers, this performance gain from parallel I/O
  becomes insignificant with further increase in server number. We examine the
  effectiveness of two read performance optimization techniques in CEFT-PVFS by
  using this tool as a benchmark. Performance results indicate: (1) Doubling
  the degree of parallelism boosts the read performance to approach that of
  PVFS; (2) Skipping hotspots can substantially improve the I/O performance
  when the load on data servers is highly imbalanced. The I/O resource
  contention due to the sharing of server nodes by multiple applications in a
  cluster has been shown to degrade the performance of the original tool and
  the variation based on PVFS by up to 10 and 21 folds, respectively; whereas,
  the variation based on CEFT-PVFS only suffered a two-fold performance
  degradation.}
}

@InProceedings{zhu:ceft-pvfs,
  author = {Yifeng Zhu and Hong Jiang and Xiao Qin and Dan Feng and David R.
  Swanson},
  title = {Improved Read Performance in a Cost-Effective, Fault-Tolerant
  Parallel Virtual File System {(CEFT-PVFS)}},
  booktitle = {Workshop on Parallel I/O in Cluster Computing and Computational
  Grids},
  year = {2003},
  month = {May},
  pages = {730--735},
  publisher = {IEEE Computer Society Press},
  address = {Tokyo},
  note = {Organized at the IEEE/ACM International Symposium on Cluster
  Computing and the Grid 2003},
  URL =
  {http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190730abs.htm},
  keywords = {parallel I/O, fault-tolerance, read performance, parallel file
  system, PVFS, pario-bib},
  abstract = {Due to the ever-widening performance gap between processors and
  disks, I/O operations tend to become the major performance bottleneck of
  data-intensive applications on modern clusters. If all the existing disks on
  the nodes of a cluster are connected together to establish high performance
  parallel storage systems, the cluster's overall performance can be boosted at
  no additional cost. CEFT-PVFS (a RAID 10 style parallel file system that
  extends the original PVFS), as one such system, divides the cluster nodes
  into two groups, stripes the data across one group in a round-robin fashion,
  and then duplicates the same data to the other group to provide storage
  service of high performance and high reliability. Previous research has shown
  that the system reliability is improved by a factor of more than 40 with
  mirroring while maintaining a comparable write performance. This paper
  presents another benefit of CEFT-PVFS in which the aggregate peak read
  performance can be improved by as much as 100% over that of the original PVFS
  by exploiting the increased parallelism. \par Additionally, when the data
  servers, which typically are also computational nodes in a cluster
  environment, are loaded in an unbalanced way by applications running in the
  cluster, the read performance of PVFS will be degraded significantly. On the
  contrary, in the CEFT-PVFS, a heavily loaded data server can be skipped and
  all the desired data is read from its mirroring node. Thus the performance
  will not be affected unless both the server node and its mirroring node are
  heavily loaded.}
}