Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: Describes a parallel IBM/370, where they
attach several small 370s to a switch, and several disks to each 370. Not
much in the way of striping.
Abstract: As computation and communication
hardware performance continue to rapidly increase, I/O represents a growing
fraction of application execution time. This gap between the I/O subsystem
and others is expected to increase in future since I/O performance is limited
by physical motion. Therefore, it is imperative that novel techniques for
improveing I/O performance be developed. Parallel I/O is a promising approach
to alleviating this bottleneck. However, very little work exist with respect
to scheduling parallel I/O operations explicitly. In this paper, we address
the problem of effective management of parallel I/O in cluster computing
systems by using appropriate I/O scheduling strategies. We propose two new
I/O scheduling algorithms and compare them with two existing scheduling
Approaches. The preliminary results show that the proposed policies
outperform existing policies substantially.
Keywords: parallel I/O, I/O scheduling algorithms,
pario-bib
Keywords: parallel I/O, out-of-core algorithm,
pario-bib
Comment: See also the component papers
vitter:survey, arge:lower, crauser:segment, grossi:crosstrees, toledo:survey.
Not clear to what extent these papers are about *parallel* I/O.
Keywords: out-of-core algorithm, graph, pario-bib
Keywords: parallel I/O, I/O, pario-bib
Comment: Derives formulas for the speedup with and
without I/O considered and with parallel software and hardware format
conversion. Considering I/O gives a more optimistic view of the speedup of a
program assuming that the parallel version can use its I/O bandwidth as
effectively as the serial processor. Concludes that, for a given number of
processors, increasing the I/O bandwidth is the most effective way to speed
up the program (over the format conversion improvements).
Abstract: Getting good I/O performance from
parallel programs is a critical problem for many application domains. In this
paper, we report our experience tuning the I/O performance of four
application programs from the areas of satellite-data processing and linear
algebra. After tuning, three of the four applications achieve
application-level I/O rates of over 100 MB/s on 16 processors. The total
volume of I/O required by the programs ranged from about 75 MB to over 200
GB. We report the lessons learned in achieving high I/O performance from
these applications, including the need for code restructuring, local disks on
every node and knowledge of future I/O requests. We also report our
experience on achieving high performance on peer-to-peer configurations.
Finally, we comment on the necessity of complex I/O interfaces like
collective I/O and strided requests to achieve high performance.
Keywords: parallel I/O, filesystem workload,
parallel application, pario-bib
Abstract: We provide tight upper and lower bounds,
up to a constant factor, for the number of inputs and outputs (I/Os) between
internal memory and secondary storage required for five sorting-related
problems: sorting, the fast Fourier transform (FFT), permutation networks,
permuting, and matrix transposition. The bounds hold both in the worst case
and in the average case, and in several situations the constant factors
match. Secondary storage is modeled as a magnetic disk capable of transfering
$P$ blocks each containing $B$ records in a single time unit; the records in
each block must be input from or output to $B$ contiguous locations on the
disk. We give two optimal algorithms for the problems, which are variants of
merge sorting and distribution sorting. In particular we show for $P=1$ that
the standard merge sorting algorithm is an optimal external sorting method,
up to a constant factor in the number of I/Os. Our sorting algorithms use the
same number of I/Os as does the permutation phase of key sorting, except when
the internal memory size is extremely small, thus affirming the popular adage
that key sorting is not faster. We also give a simpler and more direct
derivation of Hong and Kung's lower bound for the FFT for the special case $B
= P = O(1)$.
Keywords: parallel I/O, sorting, pario-bib
Comment: Good comments on typical external sorts,
and how big they are. Focuses on parallelism at the disk. They give tight
theoretical bounds on the number of I/O's required to do external sorting and
other problems (FFTs, matrix transpose, etc.). If $x$ is the number of blocks
in the file and $y$ is the number of blocks that fit in memory, then the
number of I/Os needed grows as $Θ (x \log x / \log y)$. If parallel
transfers of $p$ blocks are allowed, speedup linear in $p$ is obtained.
Keywords: compiler, I/O, pario-bib
Comment: Not really about parallel applications or
parallel I/O, but I think it may be of interest to that community. They
propose a compiler framework for a compiler to insert asynchronous I/O
operations (start I/O, finish I/O), to satisfy the dependency constraints of
the program.
Abstract: The motivation for the research
presented here is to develop an approach for scheduling I/O operations in
distributed/parallel computer systems. First, a general model for specifying
the parallel I/O scheduling problem is developed. The model defines the I/O
bandwidth for different parallel/distributed architectures. Then the model is
used to establish an algorithm for scheduling I/O operations on these
architectures.
Keywords: parallel I/O, scheduling, pario-bib
Abstract: This paper describes the performance
improvements achieved by a data-intensive application by controlling the
storage policies and algorithms of a distributed storage system. The Network
Storage Manager (NSM) is a mass distributed storage framework with a unique
architecture that provides applications with the high-performance features
they need. It also provides the standard most commonly used implementation
for storage policies. Distributed Terrain Viewer (DTViewer) is an application
that utilizes NSM architecture and for efficient and reliable data delivery.
Moreover, it exploits NSM controllable architecture by plugging-in its
application-specific optimized implementations. DTViewer overrides the
default NSM policies that do not understand its sophisticated access
patterns, partitioning, and storage layout requirements. Experimental results
have show significant improvement achieved when the application-tailored
implementation are used. Such speedups are not achievable on storage systems
with no application control such as the Parallel Virtual File System PVFS.
(44 Refs.)
Keywords: application-specific storage policies,
pario-app, DTViewer, access patterns, data layout, pario-bib
Abstract: An emerging class of data-intensive
applications involve the geographically dispersed extraction of complex
scientific information from very large collections of measured or computed
data. Such applications arise, for example, in experimental physics, where
the data in question is generated by accelerators, and in simulation science,
where the data is generated by supercomputers. So-called Data Grids provide
essential infrastructure for such applications, much as the Internet provides
essential services for applications such as e-mail and the Web. We describe
here two services that we believe are fundamental to any Data Grid: reliable,
high-speed transport and replica management. Our high-speed transport
service, GridFTP, extends the popular FTP protocol with new features required
for Data Grid applications, such as striping and partial file access. Our
replica management service integrates a replica catalog with GridFTP
transfers to provide for the creation, registration, location, and management
of dataset replicas. We present the design of both services and also
preliminary performance results. Our implementations exploit security and
other services provided by the Globus Toolkit.
Keywords: computational grid, data transfer,
network, I/O, pario-bib
Abstract: We are developing a system for
collaborative research and development for a distributed group of researchers
at different institutions around the world. In a new paradigm for
collaborative computational science, the computer code and supporting
infrastructure itself becomes the collaborating instrument, just as an
accelerator becomes the collaborating tool for large numbers of distributed
researchers in particle physics. The design of this "Collaboratory" allows
many users, with very different areas of expertise, to work coherently
together, on distributed computers around the world. Different supercomputers
may be used separately, or for problems exceeding the capacity of any single
system, multiple supercomputers may be networked together through high speed
gigabit networks. Central to this Collaboratory is a new type of community
simulation code, called "Cactus". The scientific driving force behind this
project is the simulation of Einstein's equations for studying black holes,
gravitational waves, and neutron stars, which has brought together
researchers in very different fields from many groups around the world to
make advances in the study of relativity and astrophysics. But the system is
also being developed to provide scientists and engineers, without expert
knowledge of parallel or distributed computing, mesh refinement, and so on,
with a simple framework for solving any system of partial differential
equations on many parallel computer systems, from traditional supercomputers
to networks of workstations.
Keywords: scientific application, grid,
input/output, parallel-io, pario-bib
Comment: invited talk. They describe a
computational toolkit (CACTUS) that allows developers to construct code
modules (thorns) to plug into the core system (cactus flesh). The toolkit
includes thorns for solving partial differential equations using MPI,
parallel elliptic solvers, thorns for I/O using FlexIO or HDF5, and thorns
for checkpointing. The talk showed results from a cactus code demo that ran
at SC'98. The demo combined two tightly-connected supercomputers (one in
Europe and one in America) using Globus to simulate the collision of two
neutron stars.
Keywords: fault tolerance, RAID, disk array,
parallel I/O, pario-bib
Abstract: Enterprise-scale storage systems, which
can contain hundreds of host computers and storage devices and up to tens of
thousands of disks and logical volumes, are difficult to design. The volume
of choices that need to be made is massive, and many choices have unforeseen
interactions. Storage system design is tedious and complicated to do by hand,
usually leading to solutions that are grossly over-provisioned, substantially
under-performing or, in the worst case, both.To solve the configuration
nightmare, we present minerva: a suite of tools for designing storage systems
automatically. Minerva uses declarative specifications of application
requirements and device capabilities; constraint-based formulations of the
various sub-problems; and optimization techniques to explore the search space
of possible solutions.This paper also explores and evaluates the design
decisions that went into Minerva, using specialized micro- and
macro-benchmarks. We show that Minerva can successfully handle a workload
with substantial complexity (a decision-support database benchmark). Minerva
created a 16-disk design in only a few minutes that achieved the same
performance as a 30-disk system manually designed by human experts. Of equal
importance, Minerva was able to predict the resulting system's performance
before it was built.
Keywords: disk array, storage system, RAID,
automatic design, parallel I/O, pario-bib
Keywords: parallel architecture, MIMD, NUMA,
pario-bib
Comment: Interesting architecture. 3-d mesh of
pipelined packet-switch nodes, e.g., 16x16x16 is 4096 nodes, with 256 procs,
512 memory units, 256 I/O cache units, and 256 I/O processors attached. 2816
remaining nodes are just switching nodes. Each processor is 64-bit custom
chip with up to 128 simultaneous threads in execution. It alternates between
ready threads, with a deep pipeline. Inter-instruction dependencies
explicitly encoded by the compiler, stalling those threads until the
appropriate time. Each thread has a complete set of registers! Memory units
have 4-bit tags on each word, for full/empty and trap bits. Shared memory
across the network: ``The Tera ISP-level architecture is UMA, even though the
PMS-level architecture is NUMA. Put another way, the memory looks a single
cycle away to the compiler writer.'' - Burton Smith. See also tera:brochure.
Keywords: file caching, distributed file system,
pario-bib
Comment: Part of jin:io-book; reformatted version
of anderson:serverless.
In this paper, we demonstrate
the need for timing accuracy for I/O benchmarking in the context of replaying
I/O traces. We also quantitatively characterize the impact of error in
issuing I/Os on measured system parameters. For instance, we show that the
error in perceived I/O response times can be as much as +350% or -15% by
using naive benchmarking tools that have timing inaccuracies. To address this
problem, we present Buttress, a portable and flexible toolkit that can
generate I/O workloads with microsecond accuracy at the I/O throughputs of
high-end enterprise storage arrays. In particular, Buttress can issue I/O
requests within 100µs of the desired issue time even at rates of 10000 I/Os
per second (IOPS). Abstract: In benchmarking I/O systems, it is
important to generate, accurately, the I/O access pattern that one is
intending to generate. However, timing accuracy ( issuing I/Os at the desired
time) at high I/O rates is difficult to achieve on stock operating systems.
We currently lack tools to easily and accurately generate complex I/O
workloads on modern storage systems. As a result, we may be introducing
substantial errors in observed system metrics when we benchmark I/O systems
using inaccurate tools for replaying traces or for producing synthetic
workloads with known inter-arrival times.
Keywords: benchmarking software, performance
analysis, I/O access patterns, I/O workloads, pario-bib
Comment: Looks like a really cool piece of
software. Generates I/O workloads by replaying I/O traces.
Abstract: Disk arrays have a myriad of
configuration parameters that interact in counter-intuitive ways, and those
interactions can have significant impacts on cost, performance, and
reliability. Even after values for these parameters have been chosen, there
are exponentially-many ways to map data onto the disk arrays' logical units.
Meanwhile, the importance of correct choices is increasing: storage systems
represent an growing fraction of total system cost, they need to respond more
rapidly to changing needs, and there is less and less tolerance for mistakes.
We believe that automatic design and configuration of storage systems is the
only viable solution to these issues. To that end, we present a comparative
study of a range of techniques for programmatically choosing the RAID levels
to use in a disk array. Our simplest approaches are modeled on existing,
manual rules of thumb: they "tag" data with a RAID level before determining
the configuration of the array to which it is assigned. Our best approach
simultaneously determines the RAID levels for the data, the array
configuration, and the layout of data on that array. It operates as an
optimization process with the twin goals of minimizing array cost while
ensuring that storage workload performance requirements will be met. This
approach produces robust solutions with an average cost/performance
14-17{PCT} better than the best results for the tagging schemes, and up to
150-200{PCT} better than their worst solutions. We believe that this is the
first presentation and systematic analysis of a variety of novel,
fully-automatic RAID-level selection techniques.
Keywords: file systems, pario-bib
Keywords: file caching, distributed file system,
pario-bib
Comment: See anderson:serverless-sosp.
Recent parallel file-system usage
studies show that writes to write-only files are a dominant part of the
workload. Therefore, optimizing writes could have a significant impact on
overall performance. In this paper, we propose ENWRICH, a compute-processor
write-caching scheme for write-only files in parallel file systems. ENWRICH
combines low-overhead write caching at the compute processors with high
performance disk-directed I/O at the I/O processors to achieve both low
latency and high bandwidth. This combination facilitates the use of the
powerful disk-directed I/O technique independent of any particular choice of
interface. By collecting writes over many files and applications, ENWRICH
lets the I/O processors optimize disk I/O over a large pool of requests. We
evaluate our design via simulated implementation and show that ENWRICH
achieves high performance for various configurations and workloads.
Abstract: Many parallel scientific applications
need high-performance I/O. Unfortunately, end-to-end parallel-I/O performance
has not been able to keep up with substantial improvements in parallel-I/O
hardware because of poor parallel file-system software. Many radical changes,
both at the interface level and the implementation level, have recently been
proposed. One such proposed interface is collective I/O, which allows
parallel jobs to request transfer of large contiguous objects in a single
request, thereby preserving useful semantic information that would otherwise
be lost if the transfer were expressed as per-processor non-contiguous
requests. Kotz has proposed disk-directed I/O as an efficient
implementation technique for collective-I/O operations, where the compute
processors make a single collective data-transfer request, and the I/O
processors thereafter take full control of the actual data transfer,
exploiting their detailed knowledge of the disk-layout to attain
substantially improved performance.
Keywords: parallel file system, parallel I/O,
caching, pario-bib, dfk
Recent parallel file-system usage
studies show that writes to write-only files are a dominant part of the
workload. Therefore, optimizing writes could have a significant impact on
overall performance. In this paper, we propose ENWRICH, a compute-processor
write-caching scheme for write-only files in parallel file systems. ENWRICH
combines low-overhead write caching at the compute processors with high
performance disk-directed I/O at the I/O processors to achieve both low
latency and high bandwidth. This combination facilitates the use of the
powerful disk-directed I/O technique independent of any particular choice of
interface. By collecting writes over many files and applications, ENWRICH
lets the I/O processors optimize disk I/O over a large pool of requests. We
evaluate our design via simulated implementation and show that ENWRICH
achieves high performance for various configurations and workloads.
Abstract: Many parallel scientific applications
need high-performance I/O. Unfortunately, end-to-end parallel-I/O performance
has not been able to keep up with substantial improvements in parallel-I/O
hardware because of poor parallel file-system software. Many radical changes,
both at the interface level and the implementation level, have recently been
proposed. One such proposed interface is collective I/O, which allows
parallel jobs to request transfer of large contiguous objects in a single
request, thereby preserving useful semantic information that would otherwise
be lost if the transfer were expressed as per-processor non-contiguous
requests. Kotz has proposed disk-directed I/O as an efficient
implementation technique for collective-I/O operations, where the compute
processors make a single collective data-transfer request, and the I/O
processors thereafter take full control of the actual data transfer,
exploiting their detailed knowledge of the disk-layout to attain
substantially improved performance.
Keywords: parallel file system, parallel I/O,
caching, pario-bib, dfk
Abstract: High-performance parallel file systems
are needed to satisfy tremendous I/O requirements of parallel scientific
applications. The design of such parallel file systems depends on a
comprehensive understanding of the expected workload, but so far there have
been very few usage studies of multiprocessor file systems. In the first part
of this dissertation, we attempt to fill this void by measuring a real
file-system workload on a production parallel machine, namely the CM-5 at the
National Center for Supercomputing Applications. We collect information about
nearly every individual I/O request from the mix of jobs running on the
machine. Analysis of the traces leads to various recommendations for design
of future parallel file systems. Our usage study showed that writes to
write-only files are a dominant part of the workload. Therefore, optimizing
writes could have a significant impact on overall performance. In the second
part of this dissertation, we propose ENWRICH, a compute-processor
write-caching scheme for write-only files in parallel file systems. Within
its framework, ENWRICH uses a recently proposed high performance
implementation of collective I/O operations called disk-directed I/O, but it
eliminates a number of limitations of disk-directed I/O. ENWRICH combines
low-overhead write caching at the compute processors with high performance
disk-directed I/O at the I/O processors to achieve both low latency and high
bandwidth. This combination facilitates the use of the powerful disk-directed
I/O technique independent of any particular choice of interface, and without
the requirement for mapping libraries at the I/O processors. By collecting
writes over many files and applications, ENWRICH lets the I/O processors
optimize disk I/O over a large pool of requests. We evaluate our design of
ENWRICH using simulated implementation and extensive experimentation. We show
that ENWRICH achieves high performance for various configurations and
workloads. We pinpoint the reasons for ENWRICH`s failure to perform well for
certain workloads, and suggest possible enhancements. Finally, we discuss the
nuances of implementing ENWRICH on a real platform and speculate about
possible adaptations of ENWRICH for emerging multiprocessing platforms.
Keywords: parallel I/O, multiprocessor file
system, file access patterns, workload characterization, file caching,
disk-directed I/O, pario-bib
Comment: See also ap:enwrich, ap:workload, and
nieuwejaar:workload
Abstract: High-performance parallel file systems
are needed to satisfy tremendous I/O requirements of parallel scientific
applications. The design of such high-performance parallel file systems
depends on a comprehensive understanding of the expected workload, but so far
there have been very few usage studies of multiprocessor file systems. This
paper is part of the CHARISMA project, which intends to fill this void by
measuring real file-system workloads on various production parallel machines.
In particular, here we present results from the CM-5 at the National Center
for Supercomputing Applications. Our results are unique because we collect
information about nearly every individual I/O request from the mix of jobs
running on the machine. Analysis of the traces leads to various
recommendations for parallel file-system design.
Keywords: parallel I/O, file access pattern,
multiprocessor file system, file system workload, dfk, pario-bib
Comment: See also kotz:workload,
nieuwejaar:strided.
Design of such
high-performance parallel file systems depends on a thorough grasp of the
expected workload. So far there have been no comprehensive usage studies of
multiprocessor file systems. Our CHARISMA project intends to fill this void.
The first results from our study involve an iPSC/860 at NASA Ames. This paper
presents results from a different platform, the CM-5 at the National Center
for Supercomputing Applications. The CHARISMA studies are unique because we
collect information about every individual read and write request and about
the entire mix of applications running on the machines. The results of
our trace analysis lead to recommendations for parallel file system design.
First, the file system should support efficient concurrent access to many
files, and I/O requests from many jobs under varying load condit ions.
Second, it must efficiently manage large files kept open for long periods.
Third, it should expect to see small requests, predominantly sequential
access patterns, application-wide synchronous access, no concurrent
file-sharing between jobs, appreciable byte and block sharing between
processes within jobs, and strong interprocess locality. Finally, the trace
data suggest that node-level write caches and collective I/O request
interfaces may be useful in certain environments. Abstract: Rapid increases in the computational
speeds of multiprocessors have not been matched by corresponding performance
enhancements in the I/O subsystem. To satisfy the large and growing I/O
requirements of some parallel scientific applications, we need parallel file
systems that can provide high-bandwidth and high-volume data transfer between
the I/O subsystem and thousands of processors.
Keywords: parallel I/O, file access pattern,
multiprocessor file system, file system workload, dfk, pario-bib
Comment: See also kotz:workload,
nieuwejaar:strided.
Keywords: parallel file system, parallel I/O,
Intel iPSC/2, pario-bib
Comment: Studies the performance of Intel CFS.
Uses an application that reads in a huge file of records, each a genome
sequence, and compares each sequence against a given sequence. Looks at cache
performance, message latency, cost of prefetches and directory reads, and
throughput. He tries one-disk, one-proc transfer rates. Because of contention
with the directory server on one of the two I/O nodes, it was faster to put
all of the file on the other I/O node. Striping is good for multiple readers.
Best access pattern was interleaved, not segmented or separate files, because
it avoided disk seeks. Perhaps the files are stored contiguously? Can get
good speedup by reading the sequences in big integral record sizes, from CFS,
using a load-balancing scheduled by the host. Contention for directory blocks
- through single-node directory server.
Abstract: The paper presents a survey of the basic
paradigms for designing efficient external-memory algorithms and especially
for designing external-memory algorithms for computational geometry problems
with applications in GIS. As the area of external-memory algorithms is
relatively young the paper focuses on fundamental external-memory design
techniques more than on algorithms for specific GIS problems. The
presentation is survey-like with a more detailed discussion of the most
important techniques and algorithms.
Keywords: out-of-core algorithm, geographic
information system, GIS, pario-bib
Comment: not parallel? but mentions some parallel
disk stuff.
In this paper we
develop efficient new external-memory algorithms for a number of important
problems involving line segments in the plane, including trapezoid
decomposition, batched planar point location, triangulation, red-blue line
segment intersection reporting, and general line segment intersection
reporting. In GIS systems, the first three problems are useful for rendering
and modeling, and the latter two are frequently used for overlaying maps and
extracting information from them. To solve these problems, we combine and
modify in novel ways several of the previously known techniques for designing
efficient algorithms for external memory. We also develop a powerful new
technique that can be regarded as a practical external memory version of
fractional cascading. Except for the batched planar point location problem,
no algorithms specifically designed for external memory were previously known
for these problems. Our algorithms for triangulation and line segment
intersection partially answer previously posed open problems, while the
batched planar point location algorithm improves on the previously known
solution, which applied only to monotone decompositions. Our algorithm for
the red-blue line segment intersection problem is provably optimal.
Abstract: We present a set of algorithms designed
to solve large-scale geometric problems involving collections of line
segments in the plane. Geographical information systems (GIS) handle large
amounts of spatial data, and at some level the data is often manipulated as
collections of line segments. NASA's EOS project is an example of a GIS that
is expected to store and manipulate petabytes (thousands of terabytes, or
millions of gigabytes) of data! In the design of algorithms for this type of
large-scale application, it is essential to consider the problem of
minimizing I/O communication, which is the bottleneck.
Keywords: verify, out-of-core algorithm,
computational geometry, pario-bib
Comment: Special issue on cartography and
geographic information systems.
Keywords: out-of-core algorithm, computational
geometry, pario-bib
Comment: See also the component papers
vitter:survey, arge:lower, crauser:segment, grossi:crosstrees, toledo:survey.
Not clear to what extent these papers are about *parallel* I/O.
Abstract: In the design of algorithms for
large-scale applications it is essential to consider the problem of
minimizing I/O communication. Geographical information systems (GIS) are good
examples of such large-scale applications as they frequently handle huge
amounts of spatial data. In this paper we develop efficient new
external-memory algorithms for a number of important problems involving line
segments in the plane, including trapezoid decomposition, batched planar
point location, triangulation, red-blue line segment intersection reporting,
and general line segment intersection reporting. In GIS systems, the first
three problems are useful for rendering and modeling, and the latter two are
frequently used for overlaying maps and extracting information from them.
Keywords: out-of-core algorithm, computational
geometry, pario-bib
Comment: Does deal with parallel disks, though not
in great detail.
Abstract: We investigate the I/O complexity of the
problem of sorting sequences (or strings of characters) in external memory,
which is a fundamental component of many large-scale text applications. In
the standard unit-cost RAM comparison model, the complexity of sorting K
strings of total length N is Theta (K log/sub 2/ K+N). By analogy, in the
external memory (or I/O) model, where the internal memory has size M and the
block transfer size is B, it would be natural to guess that the I/O
complexity of sorting sequences is Theta ((K/B)log/sub M/B/(K/B)+(N/B)), but
the known algorithms do not come even close to achieving this bound. Our
results show, somewhat counterintuitively, that the I/O complexity of string
sorting depends upon the length of the strings relative to the block size. We
first consider a simple comparison I/O model, where the strings are not
allowed to be broken into their individual characters, and we show that the
I/O complexity of string sorting in this model is Theta ((N/sub 1//B)log/sub
M/B/(N/sub 1//B)+K/sub 2/+(N/B)), where N/sub 1/ is the total length of all
strings shorter than B and K/sub 2/ is the number of strings longer than B.
We then consider two more general I/O comparison models in which string
breaking is allowed. We obtain improved algorithms and in several cases lower
bounds that match their I/O bounds. Finally, we develop more practical
algorithms outside the comparison model.
Keywords: out-of-core algorithm, sorting
algorithm, pario-bib
Comment: This paper is really the same paper as
arge:sorting-strings.
Abstract: In this paper we address for the first
time the I/O complexity of the problem of sorting strings in external memory,
which is a fundamental component of many large-scale text applications. In
the standard unit-cost RAM comparison model, the complexity of sorting K
strings of total length N is theta(K log K + N). By analogy, in the external
memory (or I/O) model, where the internal memory has size M and the block
transfer size is B, it would be natural to guess that the I/O complexity of
sorting strings is $θ(K/B log_(M/B) (K/B) + N/B)$, but the known
algorithms do not come even close to achieving this bound. Our results show,
somewhat counterintuitively, that the I/O complexity of string sorting
depends upon the length of the strings relative to the block size. We first
consider a simple comparison I/O model, where one is not allowed to break the
strings into their characters, and we show that the I/O complexity of string
sorting in this model is $θ(N_1/B log_(M/B) (N_1/B) + K_2 log_(M/B) K_2
+ N/B)$, where $N_1$ is the total length of all strings shorter than B and
$K_2$ is the number of strings longer than B. We then consider two more
general I/O comparison models in which string breaking is allowed. We obtain
improved algorithms and in several cases lower bounds that match their I/O
bounds. Finally, we develop more practical algorithms without assuming the
comparison model.
Keywords: out-of-core algorithm, sorting, parallel
I/O, pario-bib
Comment: Not parallel? But mentions some parallel
disk stuff.
In this paper we provide evidence that the single-disk model
is strictly more powerful. We prove a lower bound on any general simulation
of the single-disk model on the multi-disk model and establish randomized and
deterministic upper bounds. Let $N$ be the problem size and let $T$ be the
number of parallel I/Os required by a program on the single-disk model. Then
any simulation of this program on the multi-disk model will require
$Ω\left(T \frac{\log(N/D)}{\log \log(N/D)}\right)$ parallel I/Os. This
lower bound holds even if replication is allowed in the multi-disk model. We
also show an $O\left(\frac{\log D}{\log \log D}\right)$ randomized upper
bound and an $O\left(\log D (\log \log D)^2\right)$ deterministic upper
bound. These results exploit an interesting analogy between the disk models
and the PRAM and DCM models of parallel computation. Abstract: The single-disk, D-head model of
parallel I/O was introduced by Agarwal and Vitter to analyze algorithms for
problem instances that are too large to fit in primary memory. Subsequently
Vitter and Shriver proposed a more realistic model in which the disk space is
partitioned into D disks, with a single head per disk. To date, each problem
for which there is a known optimal algorithm for both models has the same
asymptotic bounds on both models. Therefore, it has been unknown whether the
models are equivalent or whether the single-disk model is strictly more
powerful.
Keywords: parallel I/O, theory, parallel I/O
algorithm, pario-bib
Keywords: distributed query processing, dataflow,
pario-bib
Comment: River is a dataflow programming
environment for database query processing applications. River is specifically
designed for clusters of computers with heterogeneous performance
characteristics. The goal of the River runtime system is to adapt to
"performance faults"-portions of the system that perform poorly by
dynamically adjusting the transfer of data through the dataflow graph. River
uses two constructs to build applications: a distributed queue that deals
with performance faults by consumers, and graduated declustering that deals
with performance faults of producers. A distributed queue pushes data through
the dataflow graph at a rate proportional to the rate of consumption and
adapts to changes in consumption rates. Graduated declustering deals with
producer performance faults by reading from replicated producers. Although
River is designed specifically for query processing, they briefly discuss how
one might adapt scientific applications to work in their framework.
Abstract: We introduce River, a data-flow
programming environment and I/O substrate for clusters of computers. River is
designed to provide maximum performance in the common case- even in the
face of non-uniformities in hardware, software, and workload. River is based
on two simple design features: a high-performance distributed queue,and a
storage redundancy mechanism called graduated declustering.We have
implemented a number of data-intensive applications on River, which validate
our design with near-ideal performance in a variety of non-uniform
performance scenarios.
Keywords: cluster computing, parallel I/O,
pario-bib
Keywords: parallel I/O, prefetching, parallel file
system, pario-bib
Comment: A related paper is arunachalam:prefetch2.
Abstract: The significant difference between the
speeds of the I/O system (e.g., disks) and compute processors in parallel
systems creates a bottleneck that lowers the performance of an application
that does a considerable amount of disk accesses. A major portion of the
compute processors' time is wasted on waiting for I/O to complete. This
problem can be addressed to a certain extent, if the necessary data can be
fetched from the disk before the I/O call to the disk is issued. Fetching
data ahead of time, known as prefetching in a multiprocessor environment
depends a great deal on the application's access pattern. The subject of this
paper is implementation and performance evaluation of a prefetching prototype
in a production parallel file system on the Intel Paragon. Specifically, this
paper presents a) design and implementation of a prefetching strategy in the
parallel file system and b) performance measurements and evaluation of the
file system with and without prefetching. The prototype is designed at the
operating system level for the PFS. It is implemented in the PFS subsystem of
the Intel Paragon Operating System. It is observed that in many cases
prefetching provides considerable performance improvements. In some other
cases no improvements or some performance degradation is observed due to the
overheads incurred in prefetching.
Keywords: parallel I/O, prefetching,
multiprocessor file system, pario-bib
Comment: See arunachalam:prefetch.
Keywords: parallel I/O, disk array, RAID,
pario-bib
Comment: Part of jin:io-book; reformatted version
of asami:self.
Abstract: This paper shows the suitability of a
self-maintaining approach to Tertiary Disk, a large-scale disk array system
built from commodity components. Instead of incurring the cost of custom
hardware, we attempt to solve various problems by design and software. We
have built a cluster of storage nodes connected by switched Ethernet. Each
storage node is a PC hosting a few dozen SCSI disks, running the FreeBSD
operating system. The system is used as a web-based image server for the Zoom
Project in cooperation with the Fine Arts Museums of San Francisco
(http://www.thinker.org/). We are designing self-maintenance extension to the
OS to run on this cluster to mitigate the system administrator's burden.
There are several components required for building self-maintaining system.
One is decoupling the time failure from the time of hardware replacement.
This implies the system must have some amount of redundancy, and has no
single point of failure. Our system is fully redundant, and everything is
constructed to avoid a single point of failure. Another is correctly
identifying failures and their dependencies. The paper also outlines several
approaches to lower the human cost of system administration of such a system
and making the system as autonomous as possible.
Keywords: parallel I/O, disk array, RAID,
pario-bib
Keywords: parallel I/O, hypercube, Intel iPSC/2,
file access pattern, pario-bib
Keywords: parallel I/O, architecture, pario-bib
Comment: They describe an I/O subsystem based on
an ``active memory'' called SWIM (Structured Wafer-based Intelligent Memory).
SWIM chips are RAM chips with some built-in processing. The idea is that
these tiny processors can manipulate the data in the chip at full speed,
without dealing with memory bus or off-chip costs. Further, the chips can
work in parallel. They demonstrate how they've used this to build a national
phone database server, a high-performance IP router, and a call-screening
agent.
Abstract: We describe an I/O subsystem based on an
active memory named SWIM (Structured Wafer-based Intelligent Memory) designed
for efficient storage and manipulation of data structures. The key
architectural idea in SWIM is to associate some processing logic with each
memory chip that allows it to perform data manipulation operations locally
and to communicate with a disk or a communication line through a backend
port. The processing logic is specially designed to perform operations such
as pointer dereferencing, memory indirection, searching and bounds checking
efficiently. The I/O subsystem is built using an interconnected ensemble of
such memory logic pairs. A complex processing task can now be distributed
between a large number of small memory processors each doing a sub-task,
while still retaining a common locus of control in the host CPU for higher
level administrative and provisioning functions. We argue that active memory
based processing enables more powerful, scalable and robust designs for
storage and communications subsystems, that can support emerging network
services, multimedia workstations and wireless PCS systems. A complete
parallel hardware and software system constructed using an array of SWIM
elements has been operational for over a year. We present results from
application of SWIM to three network functions: a national phone database
server, a high performance IP router, and a call screening agent.
Keywords: parallel I/O architecture, pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Keywords: parallel I/O, pario-bib
Comment: They describe using the techniques of
delrosario and debenedictis (although without mentioning them) to provide for
channels (parallel pipes) between independent data-parallel tasks. The
technique really is the same as in debenedictus and delrosario, although they
extend it a bit to allow multiple "files" within a channel (why not use
multiple channels)? Also, they depend on the program to read and write
synchronization variables to control access to the flow of data through the
channel. While this may provide good performance in some cases, why not have
support for automatic flow control? The system can detect when a portion of
the channel is written, and release readers waiting on that portion of the
channel (if any). The paper is a bit confusing in its use of the word "file",
which seems to be used to mean different things at different points. Also,
they seem to use an arbitrary distribution for the "file", which may or may
not be the same as one of those used by the two endpoints.
Abstract: Access to shared data is critical to the
long term success of grids of distributed systems. As more parallel
applications are being used on these grids, the need for some kind of
parallel I/O facility across distributed systems increases. However, grid
middleware has thus far had only limited support for distributed parallel
I/O. In this paper, we present an implementation of the MPI-2 I/O interface
using the Globus GridFTP client API. MPI is widely used for parallel
computing, and its I/O interface maps onto a large variety of storage
systems. The limitations of using GridFTP as an MPI-I/O transport mechanism
are described, as well as support for parallel access to scientific data
formats such as HDF and NetCDF. We compare the performance of GridFTP to that
of NFS on the same network using several parallel I/O benchmarks. Our tests
indicate that GridFTP can be a workable transport for parallel I/O,
particularly for distributed read-only access to shared data sets. (26 refs.)
Keywords: grid I/O, MPI-I/O, grid middleware,
gridFTP, pario-bib
Keywords: parallel I/O, pario-bib, prefetching,
caching, multiprocessor file system, file access pattern
Comment: Basically there are two parts to this
paper. First, they will instrument applications, Intel PFS, and IBM Vesta, to
trace I/O-related activity. Then they'll use Pablo to analyze and
characterize. They plan to trace some events in detail, and the rest with
histogram counters. Second, they plan to develop caching and prefetching
policies and to analyze those with simulation, analysis, and implementation.
They note that IBM and Intel are developing parallel I/O architecture
simulators. See also poole:sio-survey, choudhary:sio-language,
bershad:sio-os.
Abstract: RAID storage arrays often possess
gigabytes of RAM for caching disk blocks. Currently, most RAID systems use
LRU or LRU-like policies to manage these caches. Since these array caches do
not recognize the presence of file system buffer caches, they redundantly
retain many of the same blocks as those cached by the file system, thereby
wasting precious cache space. In this paper, we introduce X-RAY, an exclusive
RAID array caching mechanism. X-RAY achieves a high degree of (but not
perfect) exclusivity through gray-box methods: by observing which files have
been accessed through updates to file system meta-data, X-RAY constructs an
approximate image of the contents of the file system cache and uses that
information to determine the exclusive set of blocks that should be cached by
the array. We use microbenchmarks to demonstrate that X-RAY's prediction of
the file system buffer cache contents is highly accurate, and trace-based
simulation to show that X-RAY considerably outperforms LRU and performs as
well as other more invasive approaches. The main strength of the X-RAY
approach is that it is easy to deploy - all performance gains are achieved
without changes to the SCSI protocol or the file system above.
Keywords: RAID, x-ray, caching policies, pario-bib
Keywords: parallel I/O, distributed file system,
mass storage, pario-bib
Comment: Architecture for distributed information
storage. Integrates file systems, databases, etc. Single system image, lots
of support for administration. O-O model, with storage device objects,
logical device objects, volume objects, and file objects. Methods for each
type of object, including administrative methods.
Abstract: Modern applications such as `video on
demand' require fast reading of complete files, which can be supported well
by file striping. Many conventional applications, however, are only
interested in some part of the available records. In order to avoid reading
attributes irrelevant to such applications, each attribute could be stored in
a separate (transposed) file; Aiming at I/O parallelism, byte-oriented
striping could be applied to transposed files. However, such a fragmentation
ignores the semantics of data. This fragmentation cannot be optimized by a
database management system (DBMS) because a DBMS has to perform its tasks on
the basis of data semantics. For example, queries must be translated into
file operations using a scheme that maps a data model to a file system.
However, details about files, such as the striping width, are invisible to a
DBMS. Therefore, we propose to store each transposed file related to a
composite type on a separate, independent disk drive, which means I/O
parallelism tuned to a data model. As we also aim at system reliability and
data availability, each transposed file must be duplicated on another drive.
Consequently, a DBMS also has to guarantee correctness and completeness of
the allocation of transposed files within an array of disk drives. As a
solution independent of the underlying data model, we propose an abstract
framework consisting of a meta model and a set of rules
Keywords: database, parallel I/O, pario-bib
Keywords: multiprocessor file system, file access
pattern, parallel I/O, hypercube, pario-bib
Comment: Census-data processing on an nCUBE/10 at
USC. Their program uses an interleaved pattern, which is like my lfp or gw
with multi-record records (i.e., the application does its own blocking).
Shifted to asynchronous I/O to do OBL manually. Better results if they did
more computation per I/O (of course).
The first method is a
straightforward out-of-core variant of a well-known method for in-core,
multidimensional FFTs. It performs 1-dimensional FFT computations on each
dimension in turn. This method is easy to generalize to any number of
dimensions, and it also readily permits the individual dimensions to be of
any sizes that are integer powers of 2. The key step is an out-of-core
transpose operation that places the data along each dimension into contiguous
positions on the parallel disk system so that the data for the 1-dimensional
FFTs are contiguous. The second method is an adaptation of another
well-known method for in-core, multidimensional FFTs. This method computes
all dimensions simultaneously. It is more difficult to generalize to
arbitrary radices and number of dimensions in this method than in the first
method. Our present implementation is therefore limited to two dimensions of
equal size, that are again integer powers of 2. We present I/O
complexity analyses for both methods as well as empirical results for a
DEC 2100 server and an SGI Origin 2000, each of which has a parallel disk
system. Our results indicate that the methods are comparable in speed in
two-dimensions. Abstract: We show two algorithms for computing
multidimensional Fast Fourier Transforms (FFTs) on a multiprocessor system
with distributed memory when problem sizes are so large that the data do not
fit in the memory of the entire system. Instead, data reside on a parallel
disk system and are brought into memory in sections. We use the Parallel Disk
Model for implementation and analysis.
Keywords: parallel I/O, out of core, FFT, parallel
algorithm, scientific computing, pario-bib
Comment: Undergraduate Honors Thesis. Advisor: Tom
Cormen.
Keywords: parallel I/O, hashing, reliability, disk
mirroring, pario-bib
Comment: Describes a file system for a distributed
system that scatters records of each file over many disks using hash
functions. The hash function is known by all processors, so no one processor
must be up to access the file. Any portion of the file whose disknode is
available may be accessed. Shadow nodes are used to take over for nodes that
go down, saving the info for later use by the proper node. Intended to easily
parallelize read/write accesses and global file operations, and to increase
file availability.
Keywords: disk model, I/O bus, device model, I/O
model, pario-bib
Keywords: disk model, I/O bus, device model, I/O
model, pario-bib
Given a D-disk parallel I/O
system and a globally shared I/O buffer that can hold upto M disk blocks, we
derive a lower bound of $Ω(\sqrt{D}$) on the competitive ratio of any
deterministic online prefetching algorithm with O(M) lookahead. NOM is shown
to match the lower bound using global M-block lookahead. In contrast, using
only local lookahead results in an $Ω(D)$ competitive ratio. When the
buffer is distributed into D portions of M/D blocks each, the algorithm GREED
based on local lookahead is shown to be optimal, and NOM is within a constant
factor of optimal. Thus we provide a theoretical basis for the intuition that
global lookahead is more valuable for prefetching in the case of a shared
buffer configuration whereas it is enough to provide local lookahead in case
of the distributed configuration. Finally, we analyze the performance of
these algorithms for reference strings generated by a uniformly-random
stochastic process and we show that they achieve the minimal expected number
of I/Os. These results also give bounds on the worst-case expected
performance of algorithms which employ randomization in the data layout.
Abstract: We provide a competitive analysis
framework for online prefetching and buffer management algorithms in parallel
I/O systems, using a read-once model of block references. This has widespread
applicability to key I/O-bound applications such as external merging and
concurrent playback of multiple video streams. Two realistic lookahead
models, global lookahead and local lookahead, are defined. Algorithms NOM and
GREED based on these two forms of lookahead are analyzed for shared buffer
and distributed buffer configurations, both of which occur frequently in
existing systems. An important aspect of our work is that we show how to
implement both the models of lookahead in practice using the simple
techniques of forecasting and flushing.
Keywords: disk prefetching, file caching, parallel
I/O, pario-bib
Comment: See also barve:competitive. They propose
two methods for scheduling prefetch operations in the situation where the
access pattern is largely known in advance, in such a way as to minimize the
total number of parallel I/Os. The two methods are quite straightforward, and
yet match the optimum lower bound for an on-line algorithm.
The upper bound derived on expected
I/O performance of SRM indicates that SRM is provably better than
disk-striped mergesort (DSM) for realistic parameter values D, M, and B.
Average-case simulations show further improvement on the analytical upper
bound. Unlike previously proposed optimal sorting algorithms, SRM outperforms
DSM even when the number D of parallel disks is small. Abstract: We consider the problem of sorting a
file of N records on the D-disk model of parallel I/O in which there are two
sources of parallelism. Records are transferred to and from disk concurrently
in blocks of B contiguous records. In each I/O operation, up to one block can
be transferred to or from each of the D disks in parallel. We propose a
simple, efficient, randomized mergesort algorithm called SRM that uses a
forecast-and-flush approach to overcome the inherent difficulties of simple
merging on parallel disks. SRM exhibits a limited use of randomization and
also has a useful deterministic version. Generalizing the technique of
forecasting, our algorithm is able to read in, at any time, the ``right''
block from any disk, and using the technique of flushing, our algorithm
evicts, without any I/O overhead, just the ``right'' blocks from memory to
make space for new ones to be read in. The disk layout of SRM is such that it
enjoys perfect write parallelism, avoiding fundamental inefficiencies of
previous mergesort algorithms. By analysis of generalized maximum occupancy
problems we are able to derive an analytical upper bound on SRM's expected
overhead valid for arbitrary inputs.
Keywords: parallel I/O algorithm, sorting,
pario-bib
Comment: This paper formerly called
barve:mergesort; I discovered that the paper had appeared in SPAA96, so the
SPAA96 paper is now called barve:mergesort.
Keywords: parallel I/O algorithm, sorting,
pario-bib
Abstract: In modern I/O architectures, multiple
disk drives are attached to each I/O bus. Under I/O-intensive workloads, the
disk latency for a request can be overlapped with the disk latency and data
transfers of requests to other disks, potentially resulting in an aggregate
I/O throughput at nearly bus bandwidth. This paper reports on a performance
impairment that results from a previously unknown form of convoy behavior in
disk I/O, which we call rounds. In rounds, independent requests to distinct
disks convoy, so that each disk services one request before any disk services
its next request. We analyze log files to describe read performance of
multiple Seagate Wren-7 disks that share a SCSI bus under a heavy workload,
demonstrating the rounds behavior and quantifying its performance impact.
Keywords: disk, I/O bus, parallel I/O, pario-bib
Keywords: parallel architecture, array processor,
parallel I/O, SIMD, pario-bib
Comment: This paper is reproduced in Kuhn and
Padua's (1981, IEEE) survey ``Tutorial on Parallel Processing.'' The STARAN
is an array processor that uses Multi-Dimensional-Access (MDA) memories and
permutation networks to access data in bit slices in a variety of ways, with
high-speed I/O capabilities. Its router (called the flip network) could
permute data among the array processors, or between the array processors and
external devices, including disks, video input, and displays.
Keywords: parallel I/O, parallel architecture,
simulation, pario-bib
Keywords: performance evaluation, parallel
architecture, parallel I/O, pario-bib
Comment: They use a simulator to evaluate the
performance of a parallel I/O system. They simulate the network and disks
under a synthetic workload, and measure the time it takes for I/O requests to
traverse the network, be processed, and return. They also measure the impact
of I/O requests on non-I/O messages. Their results are fairly unsurprising.
Keywords: performance evaluation, parallel
architecture, parallel I/O, pario-bib
Keywords: parallel I/O, parallel architecture,
performance analysis, pario-bib
Comment: See polished version
baylor:vulcan-perf-book. Simulation of the I/O architecture for the Vulcan
MPP at IBM TJW. This is a distributed-memory MIMD system with a bidirectional
omega-type interconnection network, and separate compute and I/O nodes. They
use a stochastic workload to evaluate the average I/O performance under a few
different situations, and then use that average performance, along with a
stochastic workload, in a detailed simulation of the interconnection network.
(What would be the effect of adding variance to the I/O-node performance?) A
key point is that the I/O node will not accept any more requests until a
current write request is finished being processed (copied into the write-back
cache). If there are many writes, this can backup the network (would a
different write-request protocol help?) Not clear how concurrency of reads
are modeled. Results show that network saturates for high request rates and
small number of I/O nodes. As request rate decreases or number of I/O nodes
increases, performance levels off to a reasonable value. Placement of I/O
nodes didn't make much difference, nor did extra non-I/O traffic. Given their
parameters, and for reasonable loads, 1 I/O node per 4 compute nodes was a
reasonable balance, and was scalable.
Abstract: Presented are the trace-driven
simulation results of a study conducted to evaluate the performance of the
internal parallel I/O subsystem of the Vulcan massively parallel processor
(MPP) architecture. The system sizes evaluated vary from 16 to 512 nodes. The
results show that a compute node to I/O node ratio of four is the most cost
effective for all system sizes, suggesting high scalability. Also,
processor-to-processor communication effects are negligible for small message
sizes and the greater the fraction of I/O reads, the better the I/O
performance. Worse case I/O node placement is within 13% of more efficient
placement strategies. Introducing parallelism into the internal I/O subsystem
improves I/O performance significantly.
Keywords: parallel I/O architecture, performance
evaluation, pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Abstract: In recent years, the design and
performance evaluation of parallel processors has focused on the processor,
memory and communication subsystems. As a result, these subsystems have
better performance potential than the I/O subsystem. In fact, the I/O
subsystem is the bottleneck in many machines. However, there are a number of
studies currently underway to improve the design of parallel I/O subsystems.
To develop optimal parallel I/O subsystem designs, one must have a thorough
understanding of the workload characteristics of parallel I/O and its
exploitation of the associated parallel file system. Presented are the
results of a study conducted to analyze the parallel I/O workloads of several
applications on a parallel processor using the Vesta parallel file system.
Traces of the applications are obtained to collect system events,
communication events, and parallel I/O events. The traces are then analyzed
to determine workload characteristics. The results show I/O request rates on
the order of hundreds of requests per second, a large majority of requests
are for small amount of data (less than 1500 bytes), a few requests are for
large amounts of data (on the order of megabytes), significant file sharing
among processes within a job, and strong temporal, traditional spatial, and
interprocess spatial locality.
Keywords: parallel I/O, workload characterization,
pario-bib
Comment: See polished version
baylor:workload-book. They characterize four parallel applications: sort,
matrix multiply, seismic migration, and video server, in terms of their I/O
activity. They found results that are consistent with kotz:workload, in that
they also found lots of small data requests, some large data requests,
significant file sharing and interprocess locality. This study found less of
the non-contiguous access than did kotz:workload, because of the logical
views provided by Vesta. Note on-line postscript does not include figures.
Abstract: To develop optimal parallel I/O
subsystems, one must have a thorough understanding of the workload
characteristics of parallel I/O and its exploitation of the associated
parallel file system. Presented are the results of a study conducted to
analyze the parallel I/O workloads of several applications on a parallel
processor using the Vesta parallel file system. Traces of the applications
are obtained to collect system events, communication events, and parallel I/O
events. The traces are then analyzed to determine workload characteristics.
The results show I/O request rates on the order of hundreds of requests per
second, a large majority of requests are for small amounts of data (less than
1500 bytes), a few requests are for large amounts of data (on the order of
megabytes), significant file sharing among processes within a job, and strong
temporal, traditional spatial, and interprocess spatial locality.
Keywords: parallel I/O, file access pattern,
workload characterization, file system workload, pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Keywords: BBN, parallel I/O, pario-bib
Comment: Administrative manual for the TC2000 I/O
system. Can stripe over partitions in a user-specified set of disks. Large
requests automatically split and done in parallel. See also garber:tc2000.
Keywords: parallel I/O, scientific computing,
linear algebra, pario-bib
Comment: They look at out-of-core block and slab
solvers for the Maspar. They overlap reading one block with the computation
of the previous block. They solve matrices up to 40k x 40k, and obtain 3.14
GFlops even with I/O considered.
Keywords: file access pattern, disk prefetch, file
system, pario-bib
Comment: A specialized database system for
particle physics codes. Valuable for its description of access patterns and
subsequent file access requirements. Particle-in-cell codes iterate over
timesteps, updating the position of each particle, and then the
characteristics of each cell in the grid. Particles may move from cell to
cell. Particle update needs itself and nearby gridcell data. The whole
dataset is too big for memory, and each timestep must be stored on disk for
later analysis anyway. Regular file systems are inadequate: specialized DBMS
is more appropriate. Characteristics needed by their application class:
multidimensional access (by particle type or by location, i.e., multiple
views of the data), coordination between grid and particle data, coordination
between processors, coordinated access to meta-data, inverted files,
horizontal clustering, large blocking of data, asynchronous I/O, array data,
complicated joins, and prefetching according to user-prespecified order. Note
that many of these things can be provided by a file system, but that most are
hard to come by in typical file systems, if not impossible. Many of these
features are generalizable to other applications.
Keywords: hypercube, graphics, parallel algorithm,
parallel I/O, pario-bib
Comment: About using the nCUBE/10's RT Graphics
System. They were frustrated by an unusual mapping from the graphics memory
to the display, a shortage of memory on the graphics nodes, and small message
buffers on the graphics nodes. They wrote some algorithms for collecting the
columns of pixels from the hypercube nodes, and routing them to the
appropriate graphics node. They also would have liked a better
interconnection network between the graphics nodes, at least for
synchronization.
Keywords: parallel I/O, pario-bib
Comment: Jovian is a runtime library for use with
SPMD codes, eg, HPF. They restrict IO to collective operations, and provide
extra processes to 'coalesce' the many requests from multiple CPs into fewer
larger requests to the operating system, perhaps optimized for access order.
They mention that there is a standardization process underway for specifying
data distributions. Also a compact representation for strided access to
n-dimensional data structures. Coalescing basically means combining requests
to eliminate duplication and to combine adjacent requests. Requests to
coalescers are in full blocks, to lower the processing overhead. Nonetheless,
their method involves moving requests around twice, and involve several
memory-memory copies of the data, so their overhead is high.
Keywords: parallel I/O, network, supercomputer
system, pario-bib
Comment: An update of berdahl:woodenman, close to
the final draft.
Keywords: parallel I/O, network, supercomputer
system, pario-bib
Comment: They describe a protocol for making
parallel data transfers of arbitrary data sets from one set of data servers
to another set of data servers. The goal is to be independent of specific
architectures or even types of data servers, and to work on top of existing
transport protocols. The data set is described using a gather set for the
source and a scatter set for the destination, and using a linear address
space as an intermediate representation. All the servers are contacted, they
figure out who they need to talk, and exchange port information with them.
Each pair exchanges votes on who will control the transfer (ie, who will
control the order of the transfer), and on their maximum data rates. This
information is used to settle on the control and set of ports to be used.
This proposal is not final and is under active development, so it may change.
Keywords: parallel computing, performance
evaluation, parallel file system, pario-bib
Comment: In German. They summarize typical
performance of the Intel Paragon, including the communication performance and
the parallel file-system performance.
Keywords: scientific computation, application,
parallel I/O, pario-bib
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: Four major components: networking, memory
servers, file system, and persistent object store. Networking part focuses on
low-latency support communication within an application, between
applications, and between machines (Bershad and Peterson). Memory servers,
shared virtual memory, and checkpointing support (Kai Li). File systems
support includes benchmarking, transparent informed prefetching (Gibson), a
common interface for PFS and Vesta (Snir), and integrating secondary and
tertiary storage systems (including the integration of the National Storage
Lab's HPSS (see coyne:hpss) into this project in 1995). OSF/1 (Black) will be
extended to support parallel file systems, extent-like behavior, and block
coalescing. Persistent object store (DeWitt) is radical change to an
object-oriented interface, transparent I/O (though extensible and changable
with subclassing, presumably), and heterogeneous support via the Object
Definition Language standard. Persistent objects may be integrated with the
memory servers and shared virtual memory. See also poole:sio-survey,
bagrodia:sio-character, choudhary:sio-language.
Keywords: fault tolerance, multimedia, video on
demand, parallel I/O, pario-bib
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: Much like Intel CFS, with different I/O
modes that determine when the compute nodes synchronize, and the semantics of
I/Os written to the file. They found it hard to get good bandwidth for
independent I/Os, as opposed to coordinated I/Os; part of this was due to
their RAID 3 disk array, but it is more complicated than that. Some
performance numbers were given in talk.
Keywords: RAID, disk array, reliability, parallel
I/O, pario-bib
Comment: Uses the Information Dispersal Algorithm
(IDA) to generate $n+m$ blocks from $n$ blocks, to tolerate $m$ disk
failures; all of the data from the $n$ blocks is hidden in the $n+m$ blocks.
Not with the RAID project.
Abstract: In wide area computing, programs
frequently execute at sites that are distant from their data. Data access
mechanisms are required that place limited functionality demands on an
application or host system yet permit high-performance implementations. To
address these requirements, we propose a data movement and access service
called Global Access to Secondary Storage (GASS). This service defines a
global name space via Uniform Resource Locators and allows applications to
access remote files via standard I/O interfaces. High performance is achieved
by incorporating default data movement strategies that are specialized for
I/O patterns common in wide area applications and by providing support for
programmer management of data movement. GASS forms part of the Globus
toolkit, a set of services for high-performance distributed computing. GASS
itself makes use of Globus services for security and communication, and other
Globus components use GASS services for executable staging and real-time
remote monitoring. Application experiences demonstrate that the library has
practical utility.
Keywords: wide-area network, parallel I/O,
pario-bib
Keywords: data grid, filter, pario-bib
Keywords: parallel I/O, disk shadowing,
reliability, disk mirroring, disk optimization, pario-bib
Comment: Goes further than bitton:shadow. Uses
simulation to verify results from that paper, which were expressions for the
expected seek distance of shadowed disks, using shortest-seek-time arm
scheduling. Problem is her assumption that arm positions stay independent, in
the face of correlating effects like writes, which move all arms to the same
place. Simulations match model only barely, and only in some cases. Anyway,
shadowed disks can improve performance for workloads more than 60 or 70%
reads.
Keywords: parallel I/O, disk shadowing,
reliability, disk mirroring, disk optimization, pario-bib
Comment: Also TR UIC EECS 88-1 from Univ of
Illinois at Chicago. Shadowed disks are mirroring with more than 2 disks.
Writes to all disks, reads from one with shortest seek time. Acknowledges but
ignores problem posed by lo:disks. Also considers that newer disk technology
does not have linear seek time $(a+bx)$ but rather $(a+b\sqrt{x})$. Shows
that with either seek distribution the average seek time for workloads with
at least 60% reads decreases in the number of disks. See also
bitton:schedule.
Keywords: parallel I/O, file access pattern,
pario-bib
Comment: A substantial part of this
structural-analysis application was involved in I/O, moving substructures in
and out of RAM. The Maspar IO-RAM helped a lot, nearly halving the time
required. On the Cray, the SSD had an even bigger impact, perhaps 7-12 times
faster. Their main conclusion is that caching helped. Most likely this was
due to its double-buffering, since they structured the code to
read/compute/write in large ``superblocks''.
Keywords: disk array, RAID, parallel I/O,
pario-bib
Comment: Part of jin:io-book.
Abstract: There is a growing interest in using
Java as the language for developing high-performance computing applications.
To be successful in the high-performance computing domain, however, Java must
not only be able to provide high computational performance, but also
high-performance I/O. In this paper, we first examine several approaches that
attempt to provide high-performance I/O in Java-many of which are not
obvious at first glance-and evaluate their performance on two parallel
machines, the IBM SP and the SGI Origin2000. We then propose extensions to
the Java I/O library that address the deficiencies in the Java I/O API and
improve performance dramatically. The extensions add bulk (array) I/O
operations to Java, thereby removing much of the overhead currently
associated with array I/O in Java. We have implemented the extensions in two
ways: in a standard JVM using the Java Native Interface (JNI) and in a
high-performance parallel dialect of Java called Titanium. We describe the
two implementations and present performance results that demonstrate the
benefits of the proposed extensions.
Keywords: parallel I/O, Java, file system
interface, pario-bib
Abstract: There is a growing interest in using
Java as the language for developing high-performance computing applications.
To be successful in the high-performance computing domain, however, Java must
not only be able to provide high computational performance, but also
high-performance I/O. In this paper, we first examine several approaches that
attempt to provide high-performance I/O in Java-many of which are not
obvious at first glance-and evaluate their performance on two parallel
machines, the IBM SP and the SGI Origin2000. We then propose extensions to
the Java I/O library that address the deficiencies in the Java I/O API and
improve performance dramatically. The extensions add bulk (array) I/O
operations to Java, thereby removing much of the overhead currently
associated with array I/O in Java. We have implemented the extensions in two
ways: in a standard JVM using the Java Native Interface (JNI) and in a
high-performance parallel dialect of Java called Titanium. We describe the
two implementations and present performance results that demonstrate the
benefits of the proposed extensions.
Keywords: parallel I/O, java, file system
interface, pario-bib
Keywords: parallel I/O, database, disk caching,
pario-bib
Comment: More recent than copeland:bubba, and a
little more general. This gives few details, and doesn't spend much time on
the parallel I/O. Bubba does use parallel independent disks, with a
significant effort to place data on the disks, and do the work local to the
disks, to balance the load and minimize interprocessor communication. Also
they use a single-level store (i.e., memory-mapped files) to improve
performance of their I/O system, including page locking that is assisted by
the MMU. The OS has hooks for the database manager to give memory-management
policy hints.
Keywords: file access pattern, parallel I/O,
database machine, pario-bib
Comment: Improvements in I/O bandwidth crucial for
supporting database machines, otherwise highly parallel DB machines are
useless (I/O bound). Two ways to do it: 1) synchronized interleaving by using
custom controller and regular disks to read/write same track on all disks,
which speeds individual accesses. 2) use very large cache (100-200M) to keep
blocks to re-use and to do prefetching. But see dewitt:pardbs.
Keywords: collective I/O, multiprocessor file
system, parallel I/O, pario-bib
Comment: bordawekar:collective was renamed
bordawekar:collective-tr, so this could be called bordawekar:collective.
Abstract: A majority of parallel applications
obtain parallelism by partitioning data over multiple processors. Accessing
distributed data structures like arrays from files often requires each
processor to make a large number of small non-contiguous data requests. This
problem can be addressed by replacing small non-contiguous requests by large
collective requests. This approach, known as Collective I/O, has been found
to work extremely well in practice. In this paper, we describe implementation
and evaluation of a collective I/O prototype in a production parallel file
system on the Intel Paragon. The prototype is implemented in the PFS
subsystem of the Intel Paragon Operating System. We evaluate the collective
I/O performance using its comparison with the PFS M_RECORD and M_UNIX I/O
modes. It is observed that collective I/O provides significant performance
improvement over accesses in M_UNIX mode. However, in many cases, various
implementation overheads cause collective I/O to provide lower performance
than the M_RECORD I/O mode.
Keywords: parallel I/O, mutliprocessor file
system, pario-bib
Comment: This tech report was called
bordawekar:collective, then renamed bordawekar:collective-tr, on the
appearance of the ICS paper bordawekar:collective.
Keywords: parallel I/O, inter-processor
communication, pario-bib
Comment: bordawekar:comm-tr is nearly identical in
content. Also bordawekar:commstrat is a shorter version.
Abstract: In this paper, we show that
communication in the out-of-core distributed memory problems requires both
inter-processor communication and file I/O. Given that primary data
structures reside in files, even communication requires I/O. Thus, it is
important to optimize the I/O costs associated with a communication step. We
present three methods for performing communication in out-of-core distributed
memory problems. The first method, termed as the "out-of-core" communication
method, follows a loosely synchronous model. Computation and Communication
phases in this case are clearly separated, and communication requires
permutation of data in files. The second method, termed as
"demand-driven-in-core communication" considers only communication required
of each in-core data slab individually. The third method, termed as
"producer-driven-in-core communication" goes even one step further and tries
to identify the potential (future) use of data while it is in memory. We
describe these methods in detail and provide performance results for
out-of-core applications; namely, two-dimensional FFT and two-dimensional
elliptic solver. Finally, we discuss how "out-of-core" and "in-core"
communication methods could be used in virtual memory environments on
distributed memory machines.
Keywords: parallel I/O, inter-processor
communication, pario-bib
Comment: They compare different ways to do global
communications in out-of-core applications, involving file I/O and
communication at different times. They also comment briefly on how it would
work if it depended on virtual memory at each node.
Keywords: interprocessor communication, parallel
I/O, pario-bib
Comment: Small version of bordawekar:comm.
Abstract: It is widely acknowledged that improving
parallel I/O performance is critical for widespread adoption of high
performance computing. In this paper, we show that communication in
out-of-core distributed memory problems may require both inter-processor
communication and file I/O. Thus, in order to improve I/O performance, it is
necessary to minimize the I/O costs associated with a communication step. We
present three methods for performing communication in out-of-core distributed
memory problems. The first method called the generalized collective
communication method follows a loosely synchronous model; computation and
communication phases are clearly separated, and communication requires
permutation of data in files. The second method called the receiver-driven
in-core communication considers only communication required of each in-core
data slab individually. The third method called the owner-driven in-core
communication goes even one step further and tries to identify the potential
future use of data (by the recipients) while it is in the sender's memory. We
describe these methods in detail and present a simple heuristic to choose a
communication method from among the three methods. We then provide
performance results for two out-of-core applications, the two-dimensional FFT
code and the two-dimensional elliptic Jacobi solver. Finally, we discuss how
the out-of-core and in-core communication methods can be used in virtual
memory environments on distributed memory machines.
Keywords: compiler, communication, out-of-core,
parallel I/O, inter-processor communication, pario-bib
Abstract: It is widely acknowledged that improving
parallel I/O performance is critical for widespread adoption of high
performance computing. In this paper, we show that communication in
out-of-core distributed memory problems may require both inter-processor
communication and file I/O. Thus, in order to improve I/O performance, it is
necessary to minimize the I/O costs associated with a communication step. We
present three methods for performing communication in out-of-core distributed
memory problems. The first method called the generalized collective
communication method follows a loosely synchronous model; computation and
communication phases are clearly separated, and communication requires
permutation of data in files. The second method called the receiver-driven
in-core communication considers only communication required of each in-core
data slab individually. The third method called the owner-driven in-core
communication goes even one step further and tries to identify the potential
future use of data (by the recipients) while it is in the sender's memory. We
describe these methods in detail and present a simple heuristic to choose a
communication method from among the three methods. We then provide
performance results for two out-of-core applications, the two-dimensional FFT
code and the two-dimensional elliptic Jacobi solver. Finally, we discuss how
the out-of-core and in-core communication methods can be used in virtual
memory environments on distributed memory machines.
Keywords: out-of-core, compiler, communication,
distributed memory, parallel I/O, pario-bib
Comment: See also bordawekar:comm, at ICS'95.
Abstract: None.
Keywords: parallel I/O, compiler, out-of-core,
pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Abstract: This article presents a case for
compositional file systems (CFSs). The CFS is designed using the end-to-end
argument; the basic file system attributes, therefore, are independent of the
user requirements. The CFS is designed as a functionally compositional,
structurally distributed, and dynamically extendable file system. The article
also discusses the advantages and implementation alternatives for these file
systems, and outlines possible applications.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
We
present performance measurements for the CFS on the Touchstone Delta with 512
compute nodes and 32 I/O nodes. The study focuses on file read/write rates
for various configurations of I/O and compute nodes. The study attempts to
show the effect of access modes, buffer sizes and volume restrictions on the
system performance. The paper also shows that the performance of the CFS can
greatly vary for various data distributions commonly employed in scientific
and engineering applications. Abstract: For a high-performance parallel machine
to be a scalable system, it must also have a scalable parallel I/O system.
Recently, several commercial machines (e.g. Intel Touchstone Delta, Paragon,
CM-5, Ncube-2) have been built that provide features for parallel I/O.
However, very little is understood about the performance of these I/O
systems. This paper presents an experimental evaluation of the Intel
Touchstone Delta's Concurrent File System (CFS). The CFS utilizes the
declustering of large files across the disks to improve the I/O performance.
Data files can be read or written on the CFS using 4 access modes.
Keywords: performance evaluation, multiprocessor
file system, parallel I/O, pario-bib
Comment: Some new numbers over
bordawekar:delta-fs-TR, but basically the same conclusions.
Keywords: performance evaluation, multiprocessor
file system, parallel I/O, pario-bib
Comment: Evaluating the Caltech Touchstone Delta
(512 nodes, 32 I/O nodes, 64 disks, 8 MB cache per I/O node). Basic
measurements of different access patterns and I/O modes. Location in network
doesn't seem to matter. Throughput is often limited by the software; at
least, the full hardware throughputs are rarely obtained. Sometimes they are
compnode-limited, and other times they may be being limited by the cache
management. There must be a way to push bottleneck back to the disks .
Abstract: Large scale scientific applications,
such as the Grand Challenge applications, deal with very large quantities of
data. The amount of main memory in distributed memory machines is usually not
large enough to solve problems of realistic size. This limitation results in
the need for system and application software support to provide efficient
parallel I/O for out-of-core programs. This paper describes techniques for
translating out-of-core programs written in a data parallel language like HPF
to message passing node programs with explicit parallel I/O. We describe the
basic compilation model and various steps involved in the compilation. The
compilation process is explained with the help of an out-of-core matrix
multiplication program. We first discuss how an out-of-core program can be
translated by extending the method used for translating in-core programs. We
then describe how the compiler can optimize the code by estimating the I/O
costs associated with different array access patterns and selecting the
method with the least I/O cost. This optimization can reduce the amount of
I/O by as much as an order of magnitude. Performance results on the Intel
Touchstone Delta are presented and analyzed.
Keywords: parallel I/O, compiler, pario-bib
Comment: Revised as bordawekar: This is actually
fairly different from thakur:runtime. They describe the same basic compiler
technique, where arrays are distributed across processors, and each processor
has a local array file for holding data from its local partitions. Then the
I/O needed for a loop is broken into slabs, where the program proceeds as an
alternation of (read slabs, compute, write slabs). The big new thing here is
that the compiler tries different ways to form slabs (e.g., by row or by
column), estimates the number of I/Os and the amount of data moved for each
case, and chooses the case with the smallest amount of I/O. They also mention
how the choice of memory size allocated to different arrays affects the
amount of IO, but give no algorithm other than "try all the possibilities."
Keywords: multiprocessor file system, performance
evaluation, parallel I/O, pario-bib
Comment: Part of a special issue on parallel and
distributed I/O.
Abstract: This paper presents a unified evaluation
of the I/O behavior of a commercial clustered DSM machine, the HP Exemplar.
Our study has the following objectives: (1) To evaluate the impact of
different interacting system components, namely, architecture, operating
system, and programming model, on the overall I/O behavior and identify
possible performance bottlenecks and (2) To provide hints to the users for
achieving high out-of-box I/O throughput. We find that for the DSM machines
that are built as a cluster of SMP nodes, integrated clustering of computing
and I/O resources, both hardware and software, is not advantageous for two
reasons. First, within an SMP node, the I/O bandwidth is often restricted by
the performance of the peripheral components and cannot match the memory
bandwidth. Second, since the I/O resources are shared as a global resource,
the file-access costs become non-uniform and the I/O behavior of the entire
system, in terms of the scalability and balance, degrades. We observe that
the buffered I/O performance is determined not only by the I/O subsystem, but
also by the programming model, global-shared memory subsystem, and
data-communication mechanism. Moreover, programming-model support can be
effectively used to overcome the performance constraints created by the
architecture and operating system. For example, on the HP Exemplar, users can
achieve high I/O throughput by using features of the programming model that
balance the sharing and locality of the user buffers and file systems.
Finally, we believe that at present, the I/O subsystems are being designed in
isolation and there is a need for mending the traditional memory-oriented
design approach to address this problem.
Keywords: parallel I/O, pario-bib, workload
characterization, distributed shared memory
Keywords: data parallel, parallel I/O, pario-bib
Comment: Although this is mostly a compilers
paper, there is a little bit about parallel I/O here. They comment briefly on
how their compiler framework will help them make a compiler that can provide
advice to the file system about prefetching and cache replacement, and to
decide on the layout of scratch files to optimize locality.
Keywords: parallel I/O, pario-bib
Comment: They propose some extensions to HPF to
accomodate parallel I/O.
Abstract: This report presents implementation
details of the prototype PASSION compiler. The PASSION compiler provides
support for: (1) Accessing multidimensional in-core arrays and (2)
Out-of-core computations. The PASSION compiler takes as input an annotated
I/O intensive (either an out-of-core program or program accessing distributed
arrays from files) High Performance Fortran (HPF) program. Using hints
provided by the user, the compiler modifies the computation so as to minimize
the I/O cost and restructures the program to incorporate explicit I/O calls.
In this report, compilation of out-of-core FORALL constructs is illustrated
using representative programs. Compiler support for accessing distributed
in-core data is explained using illustrative examples and supplemented by
experimental results.
Keywords: parallel I/O, compiler, FORTRAN, HPF,
pario-bib
Comment: Currently not available on WWW. Describes
implementation details of the PASSION Compiler.
We observe
that the buffered I/O performance is determined not only by the I/O
subsystem, but also by the programming model, global-shared memory subsystem,
and data-communication mechanism. Moreover, programming-model support can be
used effectively to overcome the performance constraints created by the
architecture and operating system. For example, on the HP Exemplar, users can
achieve high I/O throughput by using features of the programming model that
balance the sharing and locality of the user buffers and file systems.
Finally, we believe that at present, the I/O subsystems are being designed in
isolation, and there is a need for mending the traditional memory-oriented
design approach to address this problem. Abstract: This paper presents a unified evaluation
of the I/O behavior of a commercial clustered DSM machine, the HP Exemplar.
Our study has the following objectives: 1) To evaluate the impact of
different interacting system components, namely, architecture, operating
system, and programming model, on the overall I/O behavior and identify
possible performance bottlenecks, and 2) To provide hints to the users for
achieving high out-of-box I/O throughput. We find that for the DSM machines
that are built as a cluster of SMP nodes, integrated clustering of computing
and I/O resources, both hardware and software, is not advantageous for two
reasons. First, within an SMP node, the I/O bandwidth is often restricted by
the performance of the peripheral components and cannot match the memory
bandwidth. Second, since the I/O resources are shared as a global resource,
the file-access costs become nonuniform and the I/O behavior of the entire
system, in terms of both scalability and balance, degrades.
Keywords: parallel I/O, pario-bib, workload
characterization, distributed shared memory
Abstract: It is widely acknowledged in
high-performance computing circles that parallel input/output needs
substantial improvement in order to make scalable computers truly usable. We
present a data storage model that allows processors independent access to
their own data and a corresponding compilation strategy that integrates
data-parallel computation with data distribution for out-of-core problems.
Our results compare several communication methods and I/O optimizations using
two out-of-core problems, Jacobi iteration and LU factorization.
Keywords: parallel I/O, compiler, pario-bib
Keywords: compilers, parallel I/O, out-of-core
applications, pario-bib
Comment: Basically a summary of their I/O and
compilation model for out-of-core compilation of HPF programs. See also
paleczny:support.
In order to design these primitives, we study
various parameters used in the design of a parallel file system. We evaluate
the performance of Touchstone Delta Concurrent File System and study the
effect of parameters like number of processors, number of disks, file size on
the system performance. We compute the I/O costs for common data
distributions. We propose an alternative strategy -two phase data access
strategy- to optimize the I/O costs connected with data distributions. We
implement runtime primitives using the two-phase access strategy and show
that using these primitives not only I/O access rates are improved but also
user can obtain complex data distributions like block-block and block-cyclic.
Abstract: This thesis looks at various issues in
providing application-level software support for parallel I/O. We show that
the performance of the parallel I/O system varies greatly as a function of
data distributions. We present runtime I/O primitives for parallel languages
which allow the user to obtain a consistent performance over a wide range of
data distributions.
Keywords: parallel I/O, pario-bib
Comment: This is basically a consolidation of the
other bordawekar papers, in more detail. So he covers an experimental
analysis of the touchstone delta; of the problems arising from the
direct-access model for non-conforming distributions; of the two-phase model;
and of the run-time library to support two-phase access. See also
bordawekar:reorganize, thakur:runtime, bordawekar:efficient,
thakur:out-of-core, delrosario:two-phase, bordawekar:primitives,
bordawekar:delta-fs.
Abstract: This paper describes a framework for
analyzing dataflow within an out-of-core parallel program. Dataflow
properties of FORALL statement are analyzed and a unified I/O and
communication placement framework is presented. This placement framework can
be applied to many problems, which include eliminating redudant I/O incurred
in communication. The framework is validated by applying it for optimizing
I/O and communication in out-of-core stencil problems. Experimental
performance results on an Intel Paragon show significant reduction in I/O and
communication overhead.
Keywords: parallel I/O, compiler, pario-bib
Abstract: In this paper, we describe a framework
for optimizing communication and I/O costs in out-of-core problems. We focus
on communication and I/O optimization within a FORALL construct. We show that
existing frameworks do not extend directly to out-of-core problems and can
not exploit the FORALL semantics. We present a unified framework for the
placement of I/O and communication calls and apply it for optimizing
communication for stencil applications. Using the experimental results, we
demonstrate that correct placement of I/O and communication calls can
completely eliminate extra file I/O from communication and obtain significant
performance improvement.
Keywords: parallel I/O, compiler, pario-bib
We have devised an alternative scheme for
conducting parallel I/O - the Two-Phase Access Strategy - which guarantees
higher and more consistent performance over a wider spectrum of data
distributions. We have designed and implemented runtime primitives that make
use of the two-phase access strategy to conduct parallel I/O, and facilitate
the programming of parallel I/O operations. We describe these primitives in
detail and provide performance results which show that I/O access rates are
improved by up to several orders of magnitude. Further, we show that the
variation in performance over various data distributions is restricted to
within a factor of 2 of the best access rate. Abstract: In this paper, we show that the
performance of parallel file systems can vary greatly as a function of the
selected data distributions, and that some data distributions can not be
supported. Also, we describe how the parallel language extensions, though
simplifying the programming, do not address the performance problems found in
parallel file systems.
Keywords: parallel I/O, pario-bib
Comment: Much of this is the same as
delrosario:two-phase, except for section 4 where they describe their actual
run-time library of primitives, with a little bit about how it works. It's
not clear, for example, how their meta-data structures are distributed across
the machine. They also do not describe their methods for the data
redistribution.
Keywords: parallel I/O, compilation, pario-bib
Comment: Basically they give a case study of
out-of-core matrix multiplication to emphasize that the compiler's choice of
loop ordering and matrix distribution for in-core matmult is not a very good
choice for out-of-core matmult, because it causes too much I/O. By
reorganizing the data and the loops, they get much better performance. In
this particular case there are known algorithms which they should have used.
In general they make the point that the compiler should consider several
organizations, and estimate their costs, before generating code. They don't
propose anything more sophisticated than to try all the possible
organizations.
Abstract: In this paper, we describe a technique
for optimizing commununication for out-of-core distributed memory stencil
problems. In these problems, communication may require both inter-processor
communication and file I/O. We show that in certain cases, extra file I/O
incurred in communication can be completely eliminated by reordering in-core
computations. The in-core computation pattern is decided by: (1) how the
out-of-core data distributed into in-core slabs (tiling) and (2) how the
slabs are accessed. We show that a compiler using the stencil and processor
information can choose the tiling parameters and schedule the tile accesses
so that the extra file I/O is eliminated and overall performance is improved.
Keywords: compiler, parallel I/O, pario-bib
Abstract: In this paper, we describe a technique
for optimizing commununication for out-of-core distributed memory stencil
problems. In these problems, communication may require both inter-processor
communication and file I/O. We show that in certain cases, extra file I/O
incurred in communication can be completely eliminated by reordering in-core
computations. The in-core computation pattern is decided by: (1) how the
out-of-core data distributed into in-core slabs (tiling) and (2) how the
slabs are accessed. We show that a compiler using the stencil and processor
information can choose the tiling parameters and schedule the tile accesses
so that the extra file I/O is eliminated and overall performance is improved.
Keywords: compiler, parallel I/O, pario-bib
Keywords: parallel I/O, pario-bib
Comment: Contains much of the material from
bordawekar:hpf.
Abstract: This dissertation investigates several
issues in providing compiler support for I/O intensive parallel programs. In
this dissertation, we focus on satisfying two I/O requirements, namely,
support for accessing multidimensional arrays and support for
out-of-core computations. We analyze working spaces in I/O intensive
programs and propose three execution models to be used by users or compilers
for developing efficient I/O intensive parallel programs. Different phases in
compiling out-of-core parallel programs are then described. Three different
methods for performing communication are presented and validated using
representative application templates.We illustrate that communication in
out-of-core programs may require both inter-processor communication and file
I/O. We show that using the copy-in-copy-out semantics of the HPF
FORALL construct, extra file I/O incurred in communication can be completely
eliminated by reordering in-core computations. Two different approaches for
reordering in-core computations are presented, namely, integrated tiling and
scheduling heuristic, and dataflow framework for placing communication and
I/O calls. The discussion is supplemented with experimental performance
results of representative stencil applications. Finally, an overview of the
prototype \textsf{PASSION} (Parallel And Scalable Software for I/O) compiler
is presented. This compiler takes an annotated out-of-core High Performance
Fortran (HPF) program as input and generates the corresponding
node+message-passing program with calls to the parallel I/O runtime library.
We illustrate various functionalities of the compiler using example programs
and supplement them by experimental results.
Keywords: parallel I/O, compiler, HPF, pario-bib
Keywords: parallel I/O, distributed memory,
pario-bib
Comment: In a sense, this is about a two-phase
technique for network I/O. They consider the problem of feeding a fast
network interface (HIPPI) from a distributed-memory parallel machine (iWARP)
in which the individual internal links are slower than the external network.
So they get the processors to cooperate to reshuffle the data into a
canonical layout that is convenient to send to the gateway node, and from
there onto the external network.
Keywords: object-based storage, distributed file
system, parallel file system, pario-bib
Comment: Describes an open-source project to
develop an object-based file system for clusters. Related to the NASD project
at CMU (http://www.pdl.cs.cmu.edu/NASD/).
Keywords: hypercube, parallel I/O, Intel,
pario-bib
Comment: Some measurements and simulations of
early CFS performance. Looks terrible, but they disclaim that it is a beta
version of the first CFS. They determined that the disks are the bottleneck.
But this may just imply that they need more disks. Their parallel synthetic
applications had each process read a separate file. CFS had ridiculous
traffic overhead. Again, this was beta CFS.
Keywords: parallel I/O, disk caching, disk
architecture, pario-bib
Comment: Some new DASD products with caches
overlap cache hits with prefetch of remainder of track into cache. They use
analytical model to evaluate performance of these. They find performance
improvements of 5-15 percent under their assumptions.
Keywords: supercomputing, fortran, multiprocessor
file system interface, pario-bib
Comment: Describing their way of writing arrays to
files so that they are written in a fast, parallel way, and so that (if read
in same distribution) they can be read fast and parallel. Normal read and
write forces standard ordering, but cread and cwrite uses a compiler and
runtime selected ordering, which is stored in the file so it can be used when
rereading. Good for temp files.
Abstract: Recently several projects have started
to implement large-scale high-performance computing on "computational grids"
composed of heterogeneous and geographically distributed systems of
computers, networks, and storage devices that collectively act as a single
"virtual supercomuter". One of the great challenges for this environment is
to provide appropriate high-level programming models. High Performance
Fortran (HPF) is a language of choice for development of data parallel
components of Grid applications. Another challenge is to provide efficient
access to data that is distributed across local and remote Grid resources. In
this paper, constructs to specify parallel input and output (I/O) operations
on multidimensional arrays on the Grid in the context of HPF are proposed.
The paper also presents implementation concepts that are based on the HPF
compiler VFC, the parallel I/O runtime system Panda, Internet, and Grid
technologies. Preliminary experimental performance results are discussed in
the context of a real application example.
Keywords: parallel I/O, Fortran, HPF,
data-parallel, computational grid, pario-bib
Abstract: For an increasing number of data
intensive scientific applications, parallel I/O concepts are a major
performance issue. Tackling this issue, we provide an outline of an
input/output system designed for highly efficient, scalable and conveniently
usable parallel I/O on distributed memory systems. The main focus of this
paper is the parallel I/O runtime system support provided for
software-generated programs produced by parallelizing compilers in the
context of High Performance FORTRAN efforts. Specifically, our design is
presented in the context of the Vienna Fortran Compilation System.
Keywords: compiler transformations, runtime
support, parallel I/O, prefetching, pario-bib
Keywords: parallel I/O, high performance mass
storage system, high performance languages, compilation techniques, data
administration, pario-bib
Keywords: compiler transformations, runtime
support, declustering, parallel I/O, pario-bib
Comment: They describe some extensions to Vienna
Fortran that support parallel I/O, and how they plan to extend the compiler
and run-time system to help. They are somewhat short on details, however. The
basic idea is that file declustering is based on hints from the compiler or
programmer about how the file will be used, eg, as a matrix distributed in
thus-and-so way.
Abstract: Large scale irregular applications
involve data arrays and other data structures that are too large to fit in
main memory and hence reside on disks; such applications are called
out-of-core applications. This paper presents techniques for implementing
this kind of applications. In particular we present a design for a runtime
system to efficiently support parallel execution of irregular out-of-core
codes on distributed-memory systems. Furthermore, we describe the appropriate
program transformations required to reduce the I/O overheads for staging data
as well as for communication while maintaining load balance. The proposed
techniques can be used by a parallelizing compiler or by users writing
programs in node + message passing style. We have done a preliminary
implementation of the techniques presented here. We introduce experimental
results from a template CFD code to demonstrate the efficacy of the presented
techniques.
Keywords: parallel I/O, out of core, compiler,
library, pario-bib
Comment: The authors present techniques for
implementing large scale irregular out-of-core applications. The techniques
they describe can be used by a parallel compiler (e.g., HPF and its
extensions) or by users using message passing. The objectives of the proposed
techniques are to ''to minimize I/O accesses in all steps while maintaining
load balance and minimal communication''. They demonstrate the effectivness
of their techniques by showing results from a Computational Fluid Dynamics
(CFD) code.
Keywords: parallel I/O, out of core, irregular
applications, compiler, pario-bib
Abstract: Input and output (I/O) is a major
performance bottleneck for large-scale scientific applications running on
parallel platforms. For example, it is not uncommon that performance of
carefully tuned parallel programs can slow dramatically when they read or
write files. This is because many parallel applications need to access large
amounts of data, and although great advances have been made in the CPU and
communication performance of parallel machines, similar advances have not
been made in their I/O performance. The densities and capacities of disks
have increased significantly, but improvement in performance of individual
disks has not followed the same pace. For parallel computers to be truly
usable for solving real, large-scale problems, the I/O performance must be
scalable and balanced with respect to the CPU and communication performance
of the system. Parallel I/O techniques can help to solve this problem by
creating multiple data paths between memory and disks. However, simply adding
disk drives to an I/O system without considering the overall software design
will improve performance only marginally.
Keywords: pario-bib, parallel I/O
Keywords: distributed file system, pario-bib
Comment: See broom:acacia-tr. See also broom:impl,
lautenbach:pfs, mutisya:cache, and broom:cap.
Keywords: distributed file system, pario-bib
Comment: This paper is not specifically about
parallel I/O, but the file system will be used in the AP-1000 multiprocessor.
Acacia is a file server that is optimized for synchronous writes, like those
used in stateless protocols (eg, NFS). It writes inodes in blocks in any free
location that is close to the current head position, using indirect inode
blocks to track those. Indirect blocks are in turn written anywhere
convenient, and their positions are tracked by the superblock. There is one
slot in each cylinder reserved for the superblock, which is timestamped. They
get good performance but claim to need a better implementation, and a faster
allocation algorithm. No indication of effect on read performance.
Keywords: distributed file system, multiprocessor
file system, pario-bib
Comment: See also broom:acacia, broom:impl,
lautenbach:pfs, and mutisya:cache. This describes the semantic model for
their file system. Modelled a lot after Amoeba, they have capabilities that
represent immutable files. There are create, destroy, read, and write
operations, but the read and write can affect only part of the file, if
desired. They also have an atomic ``copy'' operation, which creates a
snapshot of the current state of the file. They also have ``spawn'' and
``merge'' operations, which are essentially begin and end a transaction, a
set of changes that are atomically merged into the file later. These seem to
be addressing issues of concurrency more than of parallelism. They also
discuss implementation somewhat, mentioning the use of distributed caches and
log-structured disk layout. Prototype in Linda (!).
Keywords: distributed file system, multiprocessor
file system, pario-bib
Comment: See also broom:acacia, lautenbach:pfs,
mutisya:cache, and broom:cap. This paper is a very sketchy overview of those;
it is better to read them.
Abstract: To ameliorate the need to spend
significant programmer time modifying parallel programs to achieve
high-performance, while maintaining compact, comprehensible source codes, the
paper advocates the use of telescoping language technology to automatically
apply, during the normal compilation process, high-level performance
enhancing transformations to applications using a high-level domain-specific
I/O library. We believe that this approach will be more acceptable to
application developers than new language extensions, but will be just as
amenable to optimization by advanced compilers, effectively making it a
domain-specific language extension for I/O. The paper describes a
domain-specific I/O library for irregular block-structured applications based
on the KeLP library, describes high-level transformations of the library
primitives for improving performance, and describes how a high-level
domain-specific optimizer for applying these transformations could be
constructed rising the telescoping languages framework.
Keywords: parallel I/O, domain-specific I/O
library, scientific computing, astronomy, pario-bib
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: They evaluate the performance of Acacia
with some simple synthetic benchmarks. Performance limited by implementation
problems in the sequential file system. Otherwise no real surprises.
Abstract: Benchmarks have historically played a
key role in guiding the progress of computer science systems research and
development, but have traditionally neglected the areas of availability,
maintainability, and evolutionary growth, areas that have recently become
critically important in high-end system design. As a first step in addressing
this deficiency, we introduce a general methodology for benchmarking the
availability of computer systems. Our methodology uses fault injection to
provoke situations where availability may be compromised, leverages existing
performance benchmarks for workload generation and data collection, and can
produce results in both detail-rich graphical presentations or in distilled
numerical summaries. We apply the methodology to measure the availability of
the software RAID systems shipped with Linux, Solaris 7 Server, and Windows
2000 Server, and find that the methodology is powerful enough not only to
quantify the impact of various failure conditions on the availability of
these systems, but also to unearth their design philosophies with respect to
transient errors and recovery policy.
Keywords: RAID, disk array, parallel I/O,
pario-bib
Keywords: parallel I/O, disk caching, database,
pario-bib
Comment: A fancy interconnection from procs to I/O
processors, intended mostly for DB applications, that uses cache at I/O end
and a switch with smarts. Cache is associative. Switch helps out in sort and
join operations.
Keywords: parallel programming, parallel I/O,
pario-bib
Comment: An overview of the CHIMP message-passing
library and the PUL set of libraries. Key design goal is portability; they
run on many systems. PUL includes PUL-GF, which supports parallel access to
files (see chapple:pulgf, chapple:pulgf-adv, and chapple:pario). Other PUL
libraries support grids and meshes, global communications, and task farms.
Contact pul@epcc.ed.ac.uk.
Abstract: The concept of block-cyclic order
elimination can be applied to out-of-core $LU$ and $QR$ matrix factorizations
on distributed memory architectures equipped with a parallel I/O system. This
elimination scheme provides load balanced computation in both the factor and
solve phases and further optimizes the use of the network bandwidth to
perform I/O operations. Stability of LU factorization is enforced by full
column pivoting. Performance results are presented for the Connection Machine
system CM-5.
Keywords: parallel I/O, linear algebra,
out-of-core, pario-bib
Comment: Short, not many details. Performance
results shows about 3.5 Gflops for all problem sizes, both in-core on small N
and out-of-core on large N.
Keywords: parallel I/O, disk striping, distributed
file system, pario-bib
Comment: See cabrera:swift, cabrera:swift2.
Describes the performance of a Swift prototype and simulation results. They
stripe data over multiple disk servers (here SPARC SLC with local disk), and
access it from a SPARC2 client. Their prototype gets nearly linear speedup
for reads and asynchronous writes; synchronous writes are slower. They hit
the limit of the Ethernet and/or the client processor with three disk
servers. Adding another Ethernet allowed them to go higher. Simulation shows
good scaling. Seems like a smarter implementation would help, as would
special- purpose parity-computation hardware. Good arguments for use of PID
instead of RAID, to avoid a centralized controller that is both a bottleneck
and a single point of failure.
Keywords: parallel I/O, disk striping, distributed
file system, pario-bib
Keywords: striping, parallel I/O, distributed
system, pario-bib
Comment: See cabrera:swift2, cabrera:swift,
cabrera:pario. Not much new here. Simulates higher-performance architectures.
Shows reasonable scalability. Counts 5 inst/byte for parity computation.
Keywords: parallel I/O, disk striping, distributed
file system, multimedia, pario-bib
Comment: See cabrera:swift2. A brief outline of a
design for a high-performance storage system, designed for storing and
retrieving large objects like color video or visualization data at very high
speed. They distribute data over several ``storage agents'', which are some
form of disk or RAID. They are all connected by a high-speed network. A
``storage manager'' decides where to spread each file, what kind of
reliability mechanism is used. User provides preallocation info such as size,
reliability level, data rate requirements, and so forth.
Keywords: parallel I/O, disk striping, distributed
file system, multimedia, pario-bib
Comment: See also cabrera:swift. More detail than
the other paper. Experimental results from a prototype that stripes files
across a distributed file system. Gets almost linear speedup in certain
cases. Much better than NFS. Simulation to extend it to larger systems.
Abstract: This paper describes an implementation
of MPI-IO using aB new parallel file system, called Expand (Expandable
Parallel File System)1, that is based on NFS servers. Expand combines
multiple NFS servers to create a distributed partition where files are
declustered. Expand requires no changes to the NFS server and uses RPC
operations to provide parallel access to the same file. Expand is also
independent of the clients, because all operations are implemented using RPC
and NFS protocol. The paper describes the design, the implementation and the
evaluation of Expand with MPI-IO. This evaluation has been made in Linux
clusters and compares Expand and PVFS.
Keywords: parallel I/O, multiprocessor file
system, NFS, pario-bib
Abstract: Applications that explore, query,
analyze, visualize, and, in general, process very large scale data sets are
known as Data Intensive Applications. Large scale data intensive computing
plays an increasingly important role in many scientific activities and
commercial applications, whether it involves data mining of commercial
transactions, experimental data analysis and visualization, or intensive
simulation such as climate modeling. By combining high performance
computation, very large data storage, high bandwidth access, and high- speed
local and wide area networking, data intensive computing enhances the
technical capabilities and usefulness of most systems. The integration of
parallel and distributed computational environments will produce major
improvements in performance for both computing intensive and data intensive
applications in the future. The purpose of this introductory article is to
provide an overview of the main issues in parallel data intensive computing
in scientific and commercial applications and to encourage the reader to go
into the more in-depth articles later in this special issue.
Keywords: parallel application, parallel I/O,
pario-bib
Keywords: parallel I/O, RAID, pario-bib
Comment: See cao:tickertaip-tr2.
Keywords: parallel I/O, RAID, pario-bib
Comment: Superceded by cao:tickertaip-tr2 and
cao:jtickertaip.
Keywords: parallel I/O, RAID, pario-bib
Comment: A parallelized RAID architecture that
distributes the RAID controller operations across several worker nodes.
Multiple hosts can connect to different workers, allowing multiple paths into
the array. The workers then communicate on their own fast interconnect to
accomplish the requests, distributing parity computations across multiple
workers. They get much better performance and reliability than plain RAID.
They built a prototype and a performance simulator. Two-phase commit was
needed for request atomicity, and a request sequencer was needed for
serialization. Also found it was good to give the whole request info to all
workers and to let them figure out what to do and when. Superceded by
cao:tickertaip-tr2 and cao:tickertaip.
Keywords: parallel I/O, RAID, pario-bib
Comment: Revised version of cao:tickertaip,
actually: ``It's the ISCA paper with some text edits plus some new results on
what happens if you turn disk request-scheduling on. It's been sent to
TOCS.''. Thus it supercedes both cao:tickertaip-tr and cao:tickertaip.
Eventually published as cao:jtickertaip.
Abstract: Caching has been intensively used in
memory and traditional file systems to improve system performance. However,
the use of caching in parallel file systems and I/O libraries has been
limited to I/O nodes to avoid cache coherence problems. In this paper, we
specify an adaptive cache coherence protocol very suitable for parallel file
systems and parallel I/O libraries. This model exploits the use of caching,
both at processing and I/O nodes, providing performance increase mechanisms
as aggressive prefetching and delayed-write techniques. The cache coherence
problem is solved by using a dynamic scheme of cache coherence protocols with
different sizes and shapes of granularity. The proposed model is very
appropriate for parallel I/O interfaces, as MPI-IO. Performance results,
obtained on an IBM SP2, are presented to demonstrate the advantages offered
by the cache management methods proposed.
Keywords: parallel file system, caching, cache
coherence, adaptive caching, protocol specification, pario-bib
Keywords: persistent systems, database, parallel
I/O, object-oriented, pario-bib
Comment: SHORE is a persistent object database
system. It is intended for parallel or distributed systems, and attempts to
combine both DB and file system features. Everything in the database is a
typed object, in that there is a registered interface object that defines
this type, including the basic data types of elements of the object, and
methods that manipulate the object. Every object has an OID, and objects can
refer to other objects with the OID. But they also support unix-like
namespace, in which the names refer to objects by giving the OID. They also
have a unix-compatibility library that provides access to many objects
through the unix file interface. Every node has a SHORE server, and
applications talk to their local server for all their needs. The local server
talks to other servers as needed. The servers are also responsible for
caching pages and managing locks and transactions.
In this paper, we describe the design and
implementation of PVFS and present performance results on the Chiba City
cluster at Argonne. We provide performance results for a workload of
concurrent reads and writes for various numbers of compute nodes, I/O nodes,
and I/O request sizes. We also present performance results for MPI-IO on
PVFS, both for a concurrent read/write workload and for the BTIO benchmark.
We compare the I/O performance when using a Myrinet network versus a
fast-ethernet network for I/O-related communication in PVFS. We obtained read
and write bandwidths as high as 700 Mbytes/sec with Myrinet and
225 Mbytes/sec with fast ethernet. Abstract: As Linux clusters have matured as
platforms for low-cost, high-performance parallel computing, software
packages to provide many key services have emerged, especially in areas such
as message passing and networking. One area devoid of support, however, has
been parallel file systems, which are critical for high-performance I/O on
such clusters. We have developed a parallel file system for Linux clusters,
called the Parallel Virtual File System (PVFS). PVFS is intended both as a
high-performance parallel file system that anyone can download and use and as
a tool for pursuing further research in parallel I/O and parallel file
systems for Linux clusters.
Keywords: parallel I/O, parallel file system,
cluster file system, Linux, pario-bib
Comment: won the Best Paper Award.
Abstract: This document briefly describes the
components of the Cache Coherent File System (CCFS) source code. CCFS has
three main components: Client File Server (CLFS), Local File Server (LFS),
Concurrent Disk System (CDS). The main modules and functions of each
component are described here. Special emphasys has been put on interfaces and
data structures.
Keywords: multiprocessor file system, parallel
I/O, pario-bib
Comment: See carretero:*, rosales:cds, perez:clfs.
Abstract: We present two designs, namely,
"collective I/O" and "pipelined collective I/O", of a runtime library for
irregular applications based on the two-phase collective I/O technique. We
also present the optimization of both models by using chunking and
compression mechanisms. In the first scheme, all processors participate in
compressions and I/O at the same time, making scheduling of I/O requests
simpler but creating a possibility of contention at the I/O nodes. In the
second approach, processors are grouped into several groups, overlapping
communication, compression, and I/O to reduce I/O contention dynamically.
Finally, evaluation results are shown that demonstrates that we can obtain
significantly high-performance for I/O above what has been possible so far
Keywords: PASSION, parallel I/O, compression,
collective I/O, two-phase I/O, performance evaluation, pario-bib
Abstract: In this paper we present an experimental
evaluation of COMPASSION, a runtime system for irregular applications based
on collective I/O techniques. It provides a "Collective I/O" model, enhanced
with "Pipelined" operations and compression. All processors participate in
the I/O simultaneously, alone or grouped, making scheduling of I/O requests
simpler and providing support for contention management. In-memory
compression mechanisms reduce the total execution time by diminishing the
amount of I/O requested and the I/O contention. Our experiments, executed on
an Intel Paragon and on the ASCI/Red teraflops machine, demonstrate that
COMPASSION can obtain significantly high-performance for I/O above what has
been possible so far.
Keywords: PASSION, parallel I/O, compression,
collective I/O, pario-bib
Keywords: multiprocessor file system, parallel
I/O, pario-bib
Comment: See carretero:*, rosales:cds, perez:clfs.
Abstract: This paper presents a Parallel Disk
System (PDS) for general purpose multiprocessors, which provides support for
conventional file systems and databases, as well as direct access for
applications requiring high performance mass storage. We present a systematic
method to characterize a parallel I/O system, using it to evaluate PDS and to
identify an optimal PDS configuration. Several devices (single disk, Raid3
and Raid5), and different configurations of I/O nodes, each one with a
different type of device, have been simulated. Throughput and I/O rate of
each configuration have been obtained for the former configurations and
different types of workloads (database, general purpose and scientific
applications).
Keywords: multiprocessor file system, parallel
I/O, pario-bib
Comment: See carretero:*, rosales:cds, perez:clfs.
Abstract: This document describes the detailed
design of the LFS, one of the components of the Cache Coherent File System
(CCFS). CCFS has three main components: Client File Server (CLFS), Local File
Server (LFS), Concurrent Disk System (CDS). The Local File Servers are
located on each disk node, to develop file server functions in a per node
basis. The LFS will interact with the Concurrent Disk System (CDS) to execute
real input/output and to manage the disk system, partitions, distributed
partitions, etc. The LFS includes general file system services and
specialized services, and it will be responsible of maintaining cache
consistency, distributing accesses to other servers, controlling partition
information, etc.
Keywords: multiprocessor file system, parallel
I/O, pario-bib
Comment: See carretero:*, rosales:cds, perez:clfs.
Abstract: This paper gives an overview of the I/O
data mapping mechanisms of ParFiSys. Grouped management and
parallelization are presented as relevant features. I/O data mapping
mechanisms of ParFiSys, including all levels of the hierarchy, are
described in this paper.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Keywords: multiprocessor file system, parallel
I/O, pario-bib
Comment: See carretero:*, rosales:cds, perez:clfs.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Keywords: multiprocessor file system, parallel
I/O, pario-bib
Comment: See carretero:*, rosales:cds, perez:clfs.
Keywords: multiprocessor file system, parallel
I/O, pario-bib
Comment: See carretero:*, rosales:cds, perez:clfs.
Keywords: parallel I/O, I/O architecture,
pario-bib
Comment: See carretero:*, rosales:cds, perez:clfs.
Keywords: parallel I/O, benchmark, pario-bib
Comment: Specs for three scalable-I/O benchmarks
to be used for evaluating I/O for multiprocessors. One measures application
I/O by mixing I/O and computation, one measures max disk I/O by reading and
writing 80% of the total RAM memory, and the last one is for sending that
data from the file system, through the network, and back. See fineberg:nht1.
Abstract: To demonstrate the flexibility of the
Galley parallel file system and to analyze the efficiency and flexibility of
the Vesta parallel file system interface, we implemented Vesta's
application-programming interface on top of Galley. We implemented the Vesta
interface using Galley's file-access methods, whose design arose from
extensive testing and characterization of the I/O requirements of scientific
applications for high-performance multiprocessors. We used a parallel CPU,
parallel I/O, out-of-core matrix-multiplication application to test the Vesta
interface in both its ability to specify data access patterns and in its
run-time efficiency. In spite of its powerful ability to specify the
distribution of regular, non-overlapping data access patterns across disks,
we found that the Vesta interface has some significant limitations. We
discuss these limitations in detail in the paper, along with the performance
results.
Keywords: parallel I/O, multiprocessor file
system, pario-bib, dfk
Comment: See also
nils/galley.html
Abstract: With a view to improving the performance
and the fault tolerance of mass storage units, this paper concentrates on the
architectural issues of parallelizing I/O access and a disk array system by
means of definition of a new, particularly flexible architecture, called
Partial Dynamic Declustering, which is fault-tolerant and offers higher
levels of performance and reliability than the solutions normally used. A
fast distributed algorithm based on a dynamic structure and usable for the
implementation of an efficient I/O subsystem manager is proposed. Particular
attention is also paid to the definition of analytical models based on
Stochastic Reward Petri nets in order to analyze the performance and
reliability of the system proposed.
Keywords: parallel I/O, disk array, pario-bib
Abstract: We concentrate on the architectural
issues of parallelizing I/O access in a disk array system by means of
definition of a new, particularly flexible architecture, called partial
dynamic declustering, which is fault-tolerant and offers higher levels of
performance and reliability than the solutions normally used. A simulation
analysis highlights the efficiency of the proposed solution in balancing the
file system workload and demonstrates its validity in both cases of
unbalanced loads and expansion of the system. Particular attention is also
paid to the definition of analytical models, based on stochastic reward nets,
in order to analyze the performance and reliability of the system. The
response time distribution function is evaluated and a specific performance
analysis with varying degrees of declustering and workload is carried out.
Keywords: parallel I/O, disk array, pario-bib
Abstract: The introduction of multiprocessor
architectures into computer systems has further increased the gap between
processing times and access times to mass memories, thus making the processes
more and more I/O-bound. To provide higher performance levels (both transfer
rate and I/O rate), disk array technology is based on the use of a number of
logically interconnected disks of a small size, in order to replace disks
which have a large capacity but are very expensive. With a view to improving
the performance and fault tolerance of the mass storage units, this paper
concentrates on the architectural issues of parallelizing I/O access in a
disk array system by means of definition of a new, particularly flexible
architecture, called Partial Dynamic Declustering, which is fault-tolerant
and offers higher levels of performance and reliability than the solutions
normally used. A fast distributed algorithm based on a dynamic structure and
usable for the implementation of an efficient I/O subsystem manager is
proposed and evaluated by a simulative analysis. A specific study also
characterizes the system's performance with varying degrees of declustering
and workload types (from the transactional to the scientific type). The
results obtained allow us to obtain the optimal configuration of the system
(number of disks per group) which will ensure the desired response time
values for varying workloads.
Keywords: parallel I/O, pario-bib
Abstract: Clusters of workstations become more and
more popular to power data server applications such as large scale Web sites
or e-Commerce applications. There has been much research on scaling the front
tiers (web servers and application servers) using clusters, but databases
usually remain on large dedicated SMP machines. In this paper, we focus on
the database tier using clusters of commodity hardware. Our approach consists
of studying different replication strategies to achieve various degree of
performance and fault tolerance. Redundant Array of Inexpensive Databases
(RAIDb) is to databases what RAID is to disks. In this paper, we focus on
RAIDb-1 that offers full replication and RAIDb-2 that introduces partial
replication, in which the user can define the degree of replication of each
database table. We present a Java implementation of RAIDb called Clustered
JDBC or C-JDBC. C-JDBC achieves both database performance scalability and
high availability at the middleware level without changing existing
applications. We show, using the TPC-W benchmark, that partial replication
(RAIDb-2) can offer better performance scalability (up to 25%) than full
replication by allowing fine-grain control on replication. Distributing and
restricting the replication of frequently written tables to a small set of
backends reduces I/O usage and improves CPU utilization of each cluster node.
Keywords: replication strategies, RAIDb, database,
pario-bib
Abstract: The paper considers he problem of
parallel external sorting in the contex of a form of heterogeneous clusters.
We introduce two algorithms and we compare them two another one that we have
previously developed. Since most common sort algorithms assume high-speed
random access to all intermediate memory, they are unsuitable if the values
to be sorted don't fit in main memory. This is the case for cluster computing
platforms which are made of standard, cheap and scarce components. For that
class of computing resources a good use of I/O operations compatible with the
requirements of load balancing and computational complexity are the key to
success. We explore three techniques and show how they can be deployed for
clusters with processor performances related by a multiplicative factor. We
validate the approaches in showing experimental results for the load
balancing factor.
Keywords: out-of-core, sorting, parallel I/O, load
balancing, data distribution, pario-app, pario-bib
Abstract: We present a new computing approach for
the parallelization on message-passing computer architectures of the DNAml
algorithm, one of the most powerful tools available for constructing
phylogenetic trees from DNA sequences. An analysis of the data dependencies
of the method gave little chances to develop an efficient parallel approach.
However, a careful run-time analysis of the behaviour of the algorithm
allowed us to propose a very efficient parallel implementation based on the
combination of advanced dynamic scheduling strategies, speculative
running-time execution decisions and I/O buffering. In this work, we discuss
specific Parallel Virtual Machine (PVM)-based implementations for a cluster
of workstations and for Distributed Memory multiprocessors, with high
performance results. The code can be obtained from our public-domain sites.
Keywords: parallel computers, run-time analysis,
phylogenetic trees, DNAml program, source code, parallel I/O, pario-bib
Comment: They discuss the parallelization on
message-passing computers of the {DNA}ml algorithm, a tool used to construct
phylogenetic trees from {DNA} sequences. By performing a run-time analysis of
the behavior of the algorithm they came up with an efficient parallel
implementation based on dynamic scheduling strategies, speculative run-time
execution decisions and I/O buffering. They use I/O buffering (prefetching)
to fetch tasks that need to be processed. The parallel code was written in C
using PVM for message passing and is avaialable via anonymous ftp at
ftp.ac.uma.es.
Keywords: object-based storage, distributed file
system, parallel file system, pario-bib
Comment: Describes an open-source project to
develop an object-based file system for clusters. Related to the NASD project
at CMU (http://www.pdl.cs.cmu.edu/NASD/).
Abstract: Because many scientific applications
require large data processing, the importance of parallel I/O has been
increasingly recognized. For collective I/O, one of the considerable features
of parallel I/O, we suggest the subgroup method. It is the way of using
collective I/O of MPI effectively in terms of application programs. From the
experimental results, we could conclude that the subgroup method for
collective I/O is more efficient than plain collective I/O.
Keywords: collective I/O, MPI subgroup, pario-bib
Keywords: parallel I/O, disk array, pario-bib,
RAID
Comment: A framework for evaluating the
reliability of RAIDs. They consider failure and repair rates that depend on
the workload.
Abstract: Due to the growing popularity of
emerging applications such as digital libraries, Video-On Demand, distance
learning, and Internet World-Wide Web, multimedia servers with a large
capacity and high performance storage subsystem are in high demand. Serial
storage interfaces are emerging technologies designed to improve the
performance of such storage subsystems. They provide high bandwidth, fault
tolerance, fair bandwidth sharing and long distance connection capability.
All of these issues are critical in designing a scalable and high performance
storage subsystem. Some of the serial storage interfaces provide the spatial
reuse feature which allows multiple concurrent transmissions. That is,
multiple hosts can access disks concurrently with full link bandwidth if
their access paths are disjoint. Spatial reuse provides a way to build a
storage subsystem whose aggregate bandwidth may be scaled up with the number
of hosts. However, it is not clear how much the performance of a storage
subsystem could be improved by the spatial reuse with different
configurations and traffic scenarios. Both limitation and capability of this
scalability need to be investigated. To understand their fundamental
performance characteristics, we derive an analytic model for the serial
storage interfaces with the spatial reuse feature. Based on this model, we
investigate the maximum aggregate throughput from different system
configurations and load distributions. We show how the number of disks needed
to saturate a loop varies with different number of hosts and different load
scenarios. We also show how the load balancing by uniformly distributing the
load to all the disks on a loop may incur high overhead. This is because the
accesses to far away disks need to go through many links and consume the
bandwidth of each link it goes through. The results show the achievable
throughput may be reduced by more than half in some cases.
Keywords: I/O interface, I/O network, I/O
architecture, parallel I/O, pario-bib
Abstract: There are two major challenges for a
high-performance remote-sensing database. First, it must provide low-latency
retrieval of very large volumes of spatio-temporal data. This requires
effective declustering and placement of a multi-dimensional dataset onto a
large disk farm. Second, the order of magnitude reduction in data-size due to
post-processing makes it imperative, from a performance perspective, that the
postprocessing be done on the machine that holds the data. This requires
careful coordination of computation and data retrieval. This paper describes
the design, implementation and evaluation of Titan, a parallel
shared-nothing database designed for handling remote-sensing data. The
computational platform for Titan is a 16-processor IBM SP-2 with four fast
disks attached to each processor. Titan is currently operational and contains
about 24 GB of AVHRR data from the NOAA-7 satellite. The experimental results
show that Titan provides good performance for global queries and interactive
response times for local queries.
Keywords: parallel databases, satellite imagery,
remote sensing, parallel I/O, pario-bib
Keywords: parallel I/O, parallel file system,
pario-bib
Comment: A more detailed spec of the datamesh
architecture, specifying components and operations. It is a block server
where blocks are associatively addressed by tags. Some search operations are
supported, as are atomic tag-changing operations. See also cao:tickertaip,
wilkes:datamesh1, wilkes:datamesh, wilkes:houses, wilkes:lessons.
Keywords: parallel I/O, pario-bib
Comment: See also bruce:chimp, chapple:pulgf, and
chapple:pulgf-adv, for general information on CHIMP and PUL-GF. This document
is an exploration of the potential ways to parallelize the underlying I/O
support for the PUL-GF interface. They reason about tradeoffs in the number
of servers, disks, and clients, but (as they note) without any performance
evaluation to back it up. In particular, they argue that there should be one
partition per disk, one server per disk, and probably one client to many
servers, or many clients to many servers. A key assumption is that a
traditional serial file system is the home location for files, and that files
are ``converted'' into parallel files (or vice versa) by replicating or
distributing them. Application could choose the number of servers (and hence
disks) for each file. Hints could be provided about many things. Interesting
idea to allow user hooks for cache prefetch and writeback functions. Support
for variable-length records (``atoms'') is a key component. Segments of a
file with different formats, e.g., a header and a matrix, may be separated
into different components when the file is distributed into parallel form.
See chapple:pulpf for info on the eventual realization of these ideas.
Keywords: parallel I/O, pario-bib
Comment: PUL is a set of libraries that run on top
of the CHIMP portable message-passing library (see bruce:chimp). One of the
PUL libraries is PUL-GF, to support file I/O. The underlying I/O support is
not parallel (but see chapple:pario). The interface is parallel, however; in
particular, it supports file modes similar to those used in many systems,
which they call single, multi, random, and independent. Formatted and
unformatted, synchronous and asynchronous. Very general
multidimensional-array read and write functions. Ability to group multiple
I/O requests into atomic units, though not a full transaction capability. See
also chapple:pulgf-adv and chapple:pario.
Keywords: parallel I/O, pario-bib
Comment: See chapple:pulgf for a definition of
PUL-GF. This document describes the internal client-server interface to
PUL-GF, including ways that users can extend the functionality of PUL-GF. In
particular, they give an example of how a new file format (a run-length
encoded 2-d matrix) can be read and written transparently as if it were a
plain matrix file. The extensibility is offered by run-time registration of
user-defined interposition functions, to be called at key moments in the
processing of a file I/O request. See also bruce:chimp and chapple:pario.
Keywords: parallel I/O, multiprocessor file system
interface, pario-bib
Comment: See also chapple:pulgf and chapple:pario.
An evolution of their parallel I/O interface. PUL-PF is a library on top of
existing file systems. Every process is either a client or a server; servers
write some portion of the file to a file in the file system. Servers can be
divided into groups so that files need not be spread across all servers.
There seems to be client caching, with consistency controlled differently
depending on access mode; when necessary, the application must call get-token
and send-token commands to serialize access to an atom. Independently of
their single, multi, random, and independent mode, they can read or write the
next, previous, current, or ``wild'' atom (wild means the next ``most
available'' atom not yet read by this process). Most I/O is on atoms, but
particles (pieces of atoms) can also be independently read and written. Hints
are supported to specify access pattern (random or sequential, stride), file
partitioning, mapping, atom size, or caching. In many of those cases it goes
beyond a hint to the supply of a user-defined function, e.g., for
cache-replacement algorithm.
Abstract: Previous implementations of out-of-core
columnsort limit the problem size to $N \leq \sqrt{(M/P)^3 / 2}$, where $N$
is the number of records to sort, $P$ is the number of processors, and $M$ is
the total number of records that the entire system can hold in its memory (so
that $M/P$ is the number of records that a single processor can hold in its
memory). We implemented two variations to out-of-core columnsort that relax
this restriction. Subblock columnsort is based on an algorithmic modification
of the underlying columnsort algorithm, and it improves the problem-size
bound to $N \leq (M/P)^{5/3} / 4^{2/3}$ but at the cost of additional disk
I/O\. $M$-columnsort changes the notion of the column size in columnsort.
Keywords: parallel I/O, sorting, out-of-core
applications, pario-bib
Abstract: Sorting very large datasets is a key
subroutine in almost any application that is built on top of a large
database. Two ways to sort out-of-core data dominate the literature:
merging-based algorithms and partitioning-based algorithms. Within these two
paradigms, all the programs that sort out-of-core data on a cluster rely on
assumptions about the input distribution. We propose a third way of
out-of-core sorting: oblivious algorithms. In all, we have developed six
programs that sort out-of-core data on a cluster. The first three programs,
based completely on Leighton's columnsort algorithm, have a restriction on
the maximum problem size that they can sort. The other three programs relax
this restriction; two are based on our original algorithmic extensions to
columnsort. We present experimental results to show that our algorithms
perform well. To the best of our knowledge, the programs presented in this
thesis are the first to sort out-of-core data on a cluster without making any
simplifying assumptions about the distribution of the data to be sorted.
Keywords: out-of-core sorting, columnsort, cluster
computing, parallel I/O, pario-bib
Comment: Doctoral dissertation. Advisor: Thomas H.
Cormen
Abstract: Leighton's columnsort algorithm sorts on
an $r \times s$ mesh, subject to the restrictions that $s$ is a divisor
of $r$ and that $r \geq 2s^2$ (so that the mesh is tall and thin). We show
how to mitigate both of these restrictions. One result is that the
requirement that $s$ is a divisor of $r$ is unnecessary; columnsort sorts
correctly whether or not $s$ divides $r$. We present two algorithms that, as
long as $s$ is a perfect square, relax the restriction that $r \geq 2s^2$;
both reduce the exponent of $s$ to $3/2$. One algorithm requires $r \geq
4s^{3/2}$ if $s$ divides $r$ and $r \geq 6s^{3/2}$ if $s$ does not
divide $r$. The other algorithm requires $r \geq 4^{3/2}$, and it requires
$s$ to be a divisor of $r$. Both algorithms have applications in increasing
the maximum problem size in out-of-core sorting programs.
Keywords: parallel I/O, sorting, out-of-core
applications, pario-bib
Abstract: In today's workstation based
environment, applications such as design databases, multimedia databases, and
knowledge bases do not fit well into the relational data processing
framework. The object-oriented data model has been proposed to model and
process such complex databases. Due to the nature of the supported
applications, object-oriented database systems need efficient mechanisms for
the retrieval of complex objects and the navigation along the semantic links
among objects. Object clustering and buffering have been suggested as
efficient mechanisms for the retrieval of complex objects. However, to
improve the efficiency of the aforementioned operations, one has to look at
the recent advances in storage technology. This paper is an attempt to
investigate the feasibility of using parallel disks for object-oriented
databases. It analyzes the conceptual changes needed to map the clustering
and buffering schemes proposed on the new underlying architecture. The
simulation and performance evaluation of the proposed leveled-clustering and
mapping schemes utilizing parallel I/O disks are presented and analyzed.
Keywords: parallel I/O, disk array, object
oriented database, pario-bib
Abstract: The complexity of parallel I/O systems
imposes significant challenge in managing and utilizing the available system
resources to meet application performance, portability and usability goals.
We believe that a parallel I/O system that automatically selects efficient
I/O plans for user applications is a solution to this problem. In this paper,
we present such an automatic performance optimization approach for scientific
applications performing collective I/O requests on multidimensional arrays.
The approach is based on a high level description of the target workload and
execution environment characteristics, and applies genetic algorithms to
select high quality I/O plans. We have validated this approach in the Panda
parallel I/O library. Our performance evaluations on the IBM SP show that
this approach can select high quality I/O plans under a variety of system
conditions with a low overhead, and the genetic algorithm-selected I/O plans
are in general better than the default plans used in Panda.
Keywords: parallel I/O, performance optimization,
genetic algorithm, pario-bib
Keywords: collective I/O, multiprocessor file
system, parallel I/O, pario-bib
Keywords: parallel I/O, RAID, disk array,
pario-bib
Comment: A experimental validation of the
performance predictions of patterson:raid, plus some extensions. Confirms
that RAID level 5 (rotated parity) is best for large read/writes, and RAID
level 1 (mirroring) is best for small reads/writes.
Keywords: parallel I/O, RAID, disk striping,
pario-bib
Comment: Choosing the optimal striping unit, i.e.,
size of contiguous data on each disk (bit, byte, block, etc.). A small
striping unit is good for low-concurrency workloads since it increases the
parallelism applied to each request, but a large striping unit can support
high-concurrency workloads where each independent request depends on fewer
disks. They do simulations to find throughput, and thus to pick the striping
unit. They find equations for the best compromise striping unit based on the
concurrency and the disk parameters, or on the disk parameters alone. Some
key assumptions may limit applicability, but this is not addressed.
Abstract: To provide high performance for
applications with a wide variety of i/o requirements and to support many
different parallel platforms, the design of a parallel i/o system must
provide for efficient utilization of available bandwidth both for disk
traffic and for message passing. In this paper we discuss the message-passing
scalability of the server-directed i/o architecture of Panda, a library for
synchronized i/o of multidimensional arrays on parallel platforms. We show
how to improve i/o performance in situations where message-passing is a
bottleneck, by combining the server-directed i/o strategy for highly
efficient use of available disk bandwidth with new mechanisms to minimize
internal communication and computation overhead in Panda. We present
experimental results that show that with these improvements, Panda will
provide high i/o performance for a wider range of applications, such as
applications running with slow interconnects, applications performing i/o
operations on large numbers of arrays, or applications that require drastic
data rearrangements as data are moved between memory and disk (e.g., array
transposition). We also argue that in the future, the improved approach to
message-passing will allow Panda to support applications that are not closely
synchronized or that run in heterogeneous environments.
Keywords: parallel I/O, parallel file system,
pario-bib
Comment: see seamons:panda. This paper goes
further with some communication improvements.
Abstract: Parallel I/O systems typically consist
of individual processors, communication networks, and a large number of
disks. Managing and utilizing these resources to meet performance,
portability and usability goals of applications has become a significant
challenge. We believe that a parallel I/O system that automatically selects
efficient I/O plans for user applications is a solution to this problem. In
this paper, we present such an automatic performance optimization approach
for scientific applications performing collective I/O requests on
multidimensional arrays. Under our approach, as optimization engine in a
parallel I/O system selects optimal I/O plans automatically without human
intervention based on a description of the application I/O requests and the
system configuration. To validate our hypothesis, we have built an optimizer
that uses a rule-based and randomized search-based algorithms to select
optimal parameter settings in Panda, a parallel I/O library for
multidimensional arrays. Our performance results obtained from two IBM SPs
with significantly different configurations show that the Panda optimizer is
able to select high-quality I/O plans and deliver high performance under a
variety of system configurations
Keywords: parallel I/O, Panda, portability,
pario-bib
Abstract: We present an analytical performance
model for Panda, a library for synchronized i/o of large multidimensional
arrays on parallel and sequential platforms, and show how the Panda
developers use this model to evaluate Panda's parallel i/o performance and
guide future Panda development. The model validation shows that system
developers can simplify performance analysis, identify potential performance
bottlenecks, and study the design trade-offs for Panda on massively parallel
platforms more easily than by conducting empirical experiments. More
importantly, we show that the outputs of the performance model can be used to
help make optimal plans for handling application i/o requests, the first step
toward our long-term goal of automatically optimizing i/o request handling in
Panda.
Keywords: performance modeling, parallel I/O,
pario-bib
Comment: On Web and CDROM only. They derive a
detailed but fairly simple model of the Panda 2.0.5 parallel I/O library, by
carefully enumerating the costs involved in a collective I/O operation. They
measure Panda, AIX, and MPI to obtain parameters, and then they validate the
model by comparison with the actual Panda implementation running a basic
benchmark and an actual application. The model predicts the benchmark
performance very well, and is as much as 20% off on the performance of the
application. They have embedded the performance model in a "simulator", which
predicts the performance of a given sequence of collective I/O requests, and
they plan to use it in future versions of Panda to formulate I/O plans by
predicting the performance resulting from several different Panda parameter
settings, and choosing the best.
Keywords: parallel I/O, RAID, disk array,
pario-bib
Comment: Basically an updated version of
patterson:raid and the prepublished version of gibson:failcorrect.
Keywords: parallel I/O, RAID, disk array,
pario-bib
Keywords: RAID, disk array, parallel I/O, survey,
pario-bib
Comment: An excellent overview of RAID concepts
and technology. It starts from the beginning with a discussion of disk
hardware, RAID basics, etc, and then goes on to discuss some of the more
advanced features. They also describe a few RAID implementations. Basically,
it is a perfect paper to read for folks new to RAID.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: A special back-end box for a Sun4 file
server, that hooks a HIPPI network through a crossbar to fast memory, a
parity engine, and a bunch of disks on SCSI. They pulled about 20 MB/s
through it, basically disk-limited; with more disks they would hit
32-40 MB/s. Much improved over RAID-I, which was limited by the memory
bandwidth of the Sun4 server.
Keywords: disk array, striping, RAID, pario-bib
Abstract: Parallel I/O systems typically consist
of individual processors, communication networks, and a large number of
disks. Managing and utilizing these resources to meet performance,
portability, and usability goals of high performance scientific applications
has become a significant challenge. For scientists, the problem is
exacerbated by the need to retune the I/O portion of their code for each
supercomputer platform where they obtain access. We believe that a parallel
I/O system that automatically selects efficient I/O plans for user
applications is a solution to this problem. The authors present such an
approach for scientific applications performing collective I/O requests on
multidimensional arrays. Under our approach, an optimization engine in a
parallel I/O system selects high quality I/O plans without human
intervention, based on a description of the application I/O requests and the
system configuration. To validate our hypothesis, we have built an optimizer
that uses rule based and randomized search based algorithms to tune parameter
settings in Panda, a parallel I/O library for multidimensional arrays. Our
performance results obtained from an IBM SP using an out-of-core matrix
multiplication application show that the Panda optimizer is able to select
high quality I/O plans and deliver high performance under a variety of system
configurations with a small total optimization overhead.
Keywords: parallel I/O, pario-bib
Keywords: parallel I/O, disk array, performance
evaluation, RAID, pario-bib
Comment: Measuring the performance of a RAID
prototype with a Sun4/280, 28 disks on 7 SCSI strings, using 4 HBA
controllers on a VME bus from the Sun. The found lots of bottlenecks really
slowed them down. Under Sprite, the disks were the bottleneck for single disk
I/O, single disk B/W, and string I/O. Sprite was a bottleneck for single disk
I/O and String I/O. The host memory was a bottleneck for string B/W, HBA B/W,
overall I/O, and overall B/W. With a simpler OS, that saved on data copying,
they did better, but were still limited by the HBA, SCSI protocol, or the VME
bus. Clearly they needed more parallelism in the busses and control system.
Keywords: RAID, disk array, network file system,
parallel I/O, pario-bib
Comment: Part of jin:io-book; reformatted version
of drapeau:raid-ii.
Keywords: parallel I/O, multimedia, tertiary
storage, memory hierarchy, pario-bib
Comment: Part of a special issue.
Our techniques apply to a number of problems, including
list ranking, which we discuss in detail, finding Euler tours,
expression-tree evaluation, centroid decomposition of a tree, least-common
ancestors, minimum spanning tree verification, connected and biconnected
components, minimum spanning forest, ear decomposition, topological sorting,
reachability, graph drawing, and visibility representation. Abstract: We present a collection of new
techniques for designing and analyzing efficient external-memory algorithms
for graph problems and illustrate how these techniques can be applied to a
wide variety of specific problems. Our results include: \begin{itemize} \item
Proximate-neighboring. We present a simple method for deriving
external-memory lower bounds via reductions from a problem we call the
``proximate neighbors'' problem. We use this technique to derive non-trivial
lower bounds for such problems as list ranking, expression tree evaluation,
and connected components. \item PRAM simulation. We give methods for
efficiently simulating PRAM computations in external memory, even for some
cases in which the PRAM algorithm is not work-optimal. We apply this to
derive a number of optimal (and simple) external-memory graph algorithms.
\item Time-forward processing. We present a general technique for
evaluating circuits (or ``circuit-like'' computations) in external memory. We
also use this in a deterministic list ranking algorithm. \item
Deterministic 3-coloring of a cycle. We give several optimal methods for
3-coloring a cycle, which can be used as a subroutine for finding large
independent sets for list ranking. Our ideas go beyond a straightforward PRAM
simulation, and may be of independent interest. \item External
depth-first search. We discuss a method for performing depth first search
and solving related problems efficiently in external memory. Our technique
can be used in conjunction with ideas due to Ullman and Yannakakis in order
to solve graph problems involving closed semi-ring computations even when
their assumption that vertices fit in main memory does not hold.
\end{itemize}
Keywords: parallel I/O algorithm, graph algorithm,
pario-bib
Abstract: Parallel scientific applications store
and retrieve very large, structured datasets. Directly supporting these
structured accesses is an important step in providing high-performance I/O
solutions for these applications. High-level interfaces such as HDF5 and
Parallel netCDF provide convenient APIs for accessing structured datasets,
and the MPI-IO interface also supports efficient access to structured data.
However, parallel file systems do not traditionally support such access. In
this work, we present an implementation of structured data access support in
the context of the Parallel Virtual File System (PVFS). We call this support
"datatype I/O" because of its similarity to MPI datatypes. This support is
built by using a reusable datatype-processing component from the MPICH2 MPI
implementation. We describe how this component is leveraged to efficiently
process structured data representations resulting from MPI-IO operations. We
quantitatively assess the solution using three test applications. We also
point to further optimizations in the processing path that could be leveraged
for even more efficient operation.
Keywords: I/O interface, high-level libraries,
PVFS, structured data representations, pario-bib
Comment: not read, don't have
A method of noncontiguous data access, list I/O, was recently
implemented in the Parallel Virtual File System (PVFS). We implement support
for this interface in the ROMIO MPI-IO implementation. Through a suite of
non-contiguous I/O tests we compared ROMIO list I/O to current methods of
ROMIO noncontiguous access and found that the list I/O interface provides
performance benefits in many noncontiguous cases. Abstract: I/O performance remains a weakness of
parallel computing systems today. While this weakness is partly attributed to
rapid advances in other system components, I/O interfaces available to
programmers and the I/O methods supported by file systems have traditionally
not matched efficiently with the types of I/O operations that scientific
applications perform, particularly noncontiguous accesses. The MPI-IO
interface allows for rich descriptions of the I/O patterns desired for
scientific applications and implementations such as ROMIO have taken
advantage of this ability while remaining limited by underlying file system
methods.
Keywords: parallel I/O, MPI-IO, ROMIO, list I/O,
noncontiguous access, pario-bib
Abstract: Processor-embedded disks, or smart
disks, with their network interface controller, can in effect be viewed as
processing elements with on-disk memory and secondary storage. The data sizes
and access patterns of today's large I/O-intensive workloads require
architectures whose processing power scales with increased storage capacity.
To address this concern, we propose and evaluate disk-based distributed smart
storage architectures. Based on analytically derived performance models, our
evaluation with representative workloads show that offloading processing and
performing point-to-point data communication improve performance over
centralized architectures. Our results also demonstrate that distributed
smart disk systems exhibit desirable scalability and can efficiently handle
I/O-intensive workloads, such as commercial decision support database (TPC-H)
queries, association rules mining, data clustering, and two-dimensional fast
Fourier transform, among others. (15 refs.)
Keywords: processor-embedded disks, smart disks,
analytic performance models, I/O workload, pario-bib
Keywords: parallel I/O, tape striping, pario-bib
Comment: URL points to tech report version. He
points out two problems with tape striping: that it is difficult to keep tape
drives synchronized due to physical variations and to bad-segment remapping
in the tape, and that the start-up cost is very high making it difficult to
get multiple tapes loaded and started at the same time. So he proposes a
'triangular interleaving' rather than the traditional round-robin
interleaving, coupled with lots of buffering, to deal with these problems. He
also proposes to use different striping factors for different files (movies),
depending on access characteristics. He includes parameters for some tape
robots.
Abstract: The paper presents an analytical model
of a whole disk array architecture, XDAC, which consists of several major
subsystems and features: the two-dimensional array structure; IO-bus with
split transaction protocol; and cache for processing multiple I/O requests in
parallel. Our modelling approach is based on a subsystem access time per
request (SATPR) concept, in which we model for each subsystem the mean access
time per disk array request. The model is fed with a given set of
representative workload parameters and then used to conduct performance
analysis for exploring the impact of fork/join synchronization as well as
evaluating some architectural design issues of the XDAC system. Moreover, by
comparing the SATPRs of subsystems, we can identify the bottleneck for
performance improvements.
Keywords: disk array, performance evaluation,
analytical model, parallel I/O, pario-bib
Abstract: Fine grained data distributions are
widely used to balance computational loads across compute processes in
parallel scientific applications. When a fine grained data distribution is
used in memory, performance of I/O intensive applications can be limited not
only by disk speed but also by message passing, because a large number of
small messages may be generated by the implementation strategy used in the
underlying parallel file system or parallel I/O library. Combining (or
packetizing) a set of small messages into a large message is generally known
to speed up parallel I/O. However, overall I/O performance is affected not
only by small messages but also by other factors like cyclic block size and
interconnect characteristics. We describe small message combination and
communication scheduling for fine grained data distributions in the Panda
parallel I/O library and analyze I/O performance on parallel platforms having
different interconnects: IBM SP2, IBM workstation cluster connected by FDDI
and Pentium II cluster connected by Myrinet.
Keywords: parallel I/O, pario-bib
In such an
environment, the message passing throughput per node is lower than the
throughput obtainable from a fast disk and it is not easy for users to
determine the configuration which will yield the best I/O performance.
We introduce an I/O strategy that exploits local data to reduce the amount of
data that must be shipped across the network, present experimental results,
and analyze the results using an analytical performance model and predict the
best choice of I/O parameters. Our experiments show that the new
strategy results in a factor of 1.2-2.1 speedup in response time compared to
the Panda version originally developed for the IBM SP2, depending on the
array sizes, distributions and compute and I/O node meshes. Further, the
performance model predicts the results within a 13% margin of error.
Abstract: A cost-effective way to run a parallel
application is to use existing workstations connected by a local area network
such as Ethernet or FDDI. In this paper, we present an approach for parallel
I/O of multidimensional arrays on small networks of workstations with a
shared-media interconnect, using the Panda I/O library.
Keywords: parallel I/O, distributed system,
pario-bib
Comment: They examine a system that supports nodes
that are both compute and I/O nodes. The assumption is that the application
is writing data to a new file, and does not care to which disks the data
goes. They are trying to decide which nodes should be used for I/O, given the
distribution of data on compute nodes and the distribution desired across
disks. They use a Hungarian algorithm to solve a weighted optimization
problem on a bipartite graph connecting I/O nodes to compute nodes, in an
attempt to minimize the data flow across the network. But there is no attempt
to make a decision that might be sensible for a future read operation that
may want to read in a different pattern.
Abstract: With the increasing number of scientific
applications manipulating huge amounts of data, effective high-level data
management is an increasingly important problem. Unfortunately, so far the
solutions to the high-level data management problem either require deep
understanding of specific storage architectures and file layouts (as in
high-performance file storage systems) or produce unsatisfactory I/O
performance in exchange for ease-of-use and portability (as in relational
DBMSs). In this paper we present a novel application development environment
which is built around an active meta-data management system (MDMS) to handle
high-level data in an effective manner. The key components of our
three-tiered architecture are user application, the MDMS, and a hierarchical
storage system (HSS). Our environment overcomes the performance problems of
pure database-oriented solutions, while maintaining their advantages in terms
of ease-of-use and portability. The high levels of performance are achieved
by the MDMS, with the aid of user-specified, performance-oriented directives.
Our environment supports a simple, easy-to-use yet powerful user interface,
leaving the task of choosing appropriate I/O techniques for the application
at hand to the MDMS. We discuss the importance of an active MDMS and show how
the three components of our environment, namely the application, the MDMS,
and the HSS, fit together. We also report performance numbers from our
ongoing implementation and illustrate that significant improvements are made
possible without undue programming effort.
Keywords: cluster computing, scientific computing,
parallel I/O, data management, pario-bib
In this paper we present a new
environment which is built around an active meta-data management system
(MDMS). The key components of our three-tiered architecture are user
application, the MDMS, and a hierarchical storage system (HSS). Our
environment overcomes the performance problems of pure database-oriented
solutions, while maintaining their advantages in terms of ease-of-use and
portability. The high levels of performance are achieved by the MDMS,
with the aid of user-specified directives. Our environment supports a simple,
easy-to-use yet powerful user interface, leaving the task of choosing
appropriate I/O techniques to the MDMS. We discuss the importance of an
active MDMS and show how the three components, namely application, the MDMS,
and the HSS, fit together. We also report performance numbers from our
initial implementation and illustrate that significant improvements are made
possible without undue programming effort. Abstract: With the increasing number of scientific
applications manipulating huge amounts of data, effective data management is
an increasingly important problem. Unfortunately, so far the solutions to
this data management problem either require deep understanding of specific
storage architectures and file layouts (as in high-performance file systems)
or produce unsatisfactory I/O performance in exchange for ease-of-use and
portability (as in relational DBMSs).
Keywords: cluster computing, scientific computing,
parallel I/O, data management, pario-bib
Comment: They argue that existing parallel file
systems are too low-level, they have their own set of I/O calls
(non-portable), and policies are generally hard-coded into the system.
Databases provide a portable layer on top of the file system, but they cannot
provide high performance. They propose to "combine the advantages of file
systems and databases, while avoiding their respective disadvantages." Their
system is composed of a user program, a meta-data management system (MDMS),
and a heirarchical storage system (HSS). The user program will query the MDMS
to learn where in the HSS their data reside, what the performance of the
storage system is, information about the acc data from the storage system,
etc...
Keywords: parallel I/O, out-of-core, pario-bib
Comment: This TR overviews the PASSION project,
and all its components: two-phase access, out-of-core support for structured
and unstructured problems, data sieving, prefetching, caching, compiler and
language support, file system support, virtual parallel file system, and
parallel pipes. They reference many of their related papers in an extensive
bibliography. See also singh:adopt, jadav:ioschedule, thakur:passion,
thakur:runtime, bordawekar:efficient, thakur:out-of-core,
delrosario:prospects, delrosario:two-phase, bordawekar:primitives,
bordawekar:delta-fs.
Abstract: We are developing a runtime library
which provides a number of routines to perform the I/O required in parallel
applications in an efficient and convenient manner. This is part of a project
called PASSION, which aims to provide software support for high-performance
parallel I/O at the compiler, runtime and file system levels. The PASSION
Runtime Library uses a high-level interface which makes it easy for the user
to specify the I/O required in the program. The user only needs to specify
what portion of the data structure needs to read from or written to the file,
and the PASSION routines will perform all the necessary I/O efficiently. This
paper gives an overview of the PASSION Runtime Library and describes in
detail its high-level interface.
Keywords: parallel I/O, runtime library, pario-bib
Comment: See also choudhary:passion.
Keywords: file system, database, parallel I/O,
pario-bib, dfk
Comment: A position paper for the Strategic
Directions in Computer Research workshop at MIT in June 1996. See gibson:sdcr
and wegner:sdcr.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: Language extensions to support parallel
I/O. Compiler optimizations. Runtime library to support the compiler and
interface with the native file system. Compiler would develop a mapping of
data to the processor memories and to the disks, and then decide on I/O
schedules to move data around, overlap I/O with computation, even move
computation around to best fit what is available in memory at a given time.
It can also help with checkpointing. Compiler should pass info to the runtime
system, which in turn may need to pass info to the file system, to help with
optimization. I/O scheduling includes reordering accesses; they even go so
far as to propose doing seek optimization in the runtime library. Support for
collective I/O. Extension of MPI to I/O, to take advantage of its support for
asynchrony, scatter-gather, etc.\ On the way, they hope to work with the FS
people to decide on the functional requirements of the file system. See also
poole:sio-survey, bagrodia:sio-character, bershad:sio-os.
Abstract: We explore the method of combining the
replication and parity approaches to tolerate multiple disk failures in a
disk array. In addition to the conventional mirrored and chained declustering
methods, a method based on the hybrid of mirrored-and-chained declustering is
explored. A performance study that explores the effect of combining
replication and parity approaches is conducted. It is experimentally shown
that the proposed approach can lead to the most cost-effective solution if
the objective is to sustain the same load as before the failures.
Keywords: fault tolerance, disk array,
replication, declustering, parallel I/O, pario-bib
Comment: Consider hybrid chained and mirrored
declustering.
Abstract: We discuss data production rates and
their impact on the performance of scientific applications using parallel
computers. On one hand, too high rates of data production can be
overwhelming, exceeding logistical capacities for transfer, storage and
analysis. On the other hand, the rate limiting step in a
computationally-based study should be the human-guided analysis, not the
calculation. We present performance data for a biomolecular simulation of the
enzyme, acetylcholinesterase, which uses the parallel molecular dynamics
program EulerGROMOS. The actual production rates are compared against a
typical time frame for results analysis where we show that the rate limiting
step is the simulation, and that to overcome this will require improved
output rates.
Keywords: parallel I/O application, molecular
dynamics, pario-bib
Comment: Note proceedings only on CD-ROM or WWW.
Keywords: parallel I/O algorithm, sorting,
pario-bib
Comment: They present a new parallel sorting
algorithm that allows overlap between disk, network, and processor. By
pipelining the tasks, they can double the speed of sorting; best results, of
course, when these three components take approximately equal time. The disk
I/O is really only used to load the initial data set and write the output
data set, rather than being used for an external sorting scheme. They obtain
their gains by overlapping that disk I/O with the communication and
processing.
Keywords: parallel I/O, HPF, compiler, pario-bib
Keywords: file system, parallel I/O, disk layout,
pario-bib
Comment: These two guys from NCAR redid the block
allocation strategy routine on the Cray. Current strategy uses round-robin
among the disks, using a different disk for each allocation request. Each
request looks for blocks on that disk, until it is satisfied or space runs
out, and then goes to the next disk. It uses a free-block bitmap to find the
blocks. Problem: too many extents, not enough contiguity. These guys tried
first-bit and best-fit from all extents on all disks. First-fit had faster
allocation time, of course, and both had much lower file fragmentation. They
also used the vector hardware to search the bitmap for non-zero words.
Keywords: parallel architecture, parallel I/O,
pario-bib
Comment: They argue for network-attached devices,
and for making I/O devices and networks, instead of CPUs, more the center of
architectural design.
The high-level abstraction layer of MPI allows us to
tackle cache coherency with additional information and coordination without
using file locks. By approaching the cache coherency issue further up, the
underlying I/O accesses can be modified in such a way as to ensure access to
coherent data while satisfying the user's I/O request. We can effectively
exploit the benefits of a file system's client-side cache while minimizing
its management costs. Abstract: In order for I/O systems to achieve high
performance in a parallel environment, they must either sacrifice client-side
file caching, or keep caching and deal with complex coherency issues. The
most common technique for dealing with cache coherency in multi-client file
caching environments uses file locks to bypass the client-side cache. Aside
from effectively disabling cache usage, file locking is sometimes unavailable
on larger systems.
Keywords: client-side file caching, file locking,
MPI, pario-bib
Abstract: This paper describes the functionality
of ViC*, a compiler for a variant of the data-parallel language C* with
support for out-of-core data. The compiler translates C* programs with shapes
declared outofcore, which describe parallel data stored on disk. The compiler
output is a SPMD-style program in standard C with I/O and library calls added
to efficiently access out-of-core parallel data. The ViC* compiler also
applies several program transformations to improve out-of-core data access.
Keywords: compiler, data-parallel programming,
programming language, virtual memory, out of core, parallel I/O, pario-bib
Abstract: This paper describes the functionality
of ViC*, a compiler for a variant of the data-parallel language C* with
support for out-of-core data. The compiler translates C* programs with shapes
declared outofcore, which describe parallel data stored on disk. The compiler
output is a SPMD-style program in standard C with I/O and library calls added
to efficiently access out-of-core parallel data. The ViC* compiler also
applies several program transformations to improve out-of-core data layout
and access.
Keywords: compiler, data-parallel programming,
programming language, virtual memory, out of core, parallel I/O, pario-bib
Keywords: parallel computer architecture, shared
memory, parallel I/O, pario-bib
Comment: The Convex Exemplar connects
hypernodes, which are basically SMP nodes built from 8 HP PA-RISC CPUs, lots
of RAM, and a crossbar switch, with their own implementation of the SCI
interconnect. Hierarchical caching supports a global shared physical address
space. Each hypernode can also have an I/O adapter, to which they can attach
lots of different I/O devices. The I/O adapter has the capability to DMA
directly into any memory in the system, even on other hypernodes. Each
hypernode runs its own file-system server, which manages UNIX file systems on
the devices of that hypernode. Striped file systems are supported in
software, although it's not clear if they can stripe across hypernodes, or
only within hypernodes, ie, whether (striped) file systems can span multiple
hypernodes.
Keywords: parallel I/O, parallel file system,
striping, pario-bib
Comment: Implementation of striped disks on the
CONVEX. Uses partitions of normal device drivers. Kernel data structure knows
about the interleaving granularity, the set of partitions, sizes, etc.
Keywords: multimedia, compression, data-parallel
computing, Maspar, parallel I/O, pario-bib
Keywords: parallel I/O, database, disk caching,
pario-bib
Comment: A database machine.
Experimental/analytical model of a placement algorithm that declusters
relations across several parallel, independent disks. The declustering is
done on a subset of the disks, and the choices involved are the number of
disks to decluster onto, which relations to put where, and whether a relation
should be cache-resident. Communications overhead limits the usefulness of
declustering in some cases, depending on the workload. See boral:bubba.
Keywords: parallel I/O, multiprocessor file system
interface, pario-bib
Comment: Part of jin:io-book, revised version of
corbett:mpi-overview.
Keywords: multiprocessor file system, Vesta,
parallel I/O, pario-bib
Comment: Part of jin:io-book; reformatted version
of corbett:jvesta.
Abstract: The Vesta parallel file system is
designed to provide parallel file access to application programs running on
multicomputers with parallel I/O subsystems. Vesta uses a new abstraction of
files: a file is not a sequence of bytes, but rather it can be partitioned
into multiple disjoint sequences that are accessed in parallel. The
partitioning-which can also be changed dynamically-reduces the need for
synchronization and coordination during the access. Some control over the
layout of data is also provided, so the layout can be matched with the
anticipated access patterns. The system is fully implemented and forms the
basis for the AIX Parallel I/O File System on the IBM SP2. The implementation
does not compromise scalability or parallelism. In fact, all data accesses
are done directly to the I/O node that contains the requested data, without
any indirection or access to shared metadata. Disk mapping and caching
functions are confined to each I/O node, so there is no need to keep data
coherent across nodes. Performance measurements shown good scalability with
increased resources. Moreover, different access patterns are show to achieve
similar performance.
Keywords: multiprocessor file system, Vesta,
parallel I/O, pario-bib
Comment: See also corbett:pfs, corbett:vesta*,
feitelson:pario. This is the ultimate Vesta reference. There seem to be only
a few small things that are completely new over what's been published
elsewhere, although this presentation is much more complete and polished.
Keywords: parallel I/O, message-passing,
multiprocesor file system interface, pario-bib
Comment: Superceded by mpi-ioc:mpi-io5. See the
MPI-IO Web page at http://parallel.nas.nasa.gov/MPI-IO/.
Keywords: parallel I/O, message-passing,
multiprocesor file system interface, pario-bib
Comment: The goal is to design a standard file
interface for SPMD message-passing programs. An earlier version of this
specification was prost:mpi-io. Superceded by mpi-ioc:mpi-io5. See also the
general MPI I/O web page at http://parallel.nas.nasa.gov/MPI-IO/.
Keywords: parallel I/O, message-passing,
multiprocesor file system interface, pario-bib
Comment: Superceded by mpi-ioc:mpi-io5. See the
MPI-IO Web page at http://parallel.nas.nasa.gov/MPI-IO/.
The MPI-IO interface is being proposed as an extension to
the MPI standard to fill this need. MPI-IO supports a high-level interface to
describe the partitioning of file data among processes, a collective
interface describing complete transfers of global data structures between
process memories and files, asynchronous I/O operations, allowing computation
to be overlapped with I/O, and optimization of physical file layout on
storage devices (disks). Abstract: Thanks to MPI, writing portable message
passing parallel programs is almost a reality. One of the remaining problems
is file I/O. Although parallel file systems support similar interfaces, the
lack of a standard makes developing a truly portable program impossible. It
is not feasible to develop large scientific applications from scratch for
each generation of parallel machine, and, in the scientific world, a program
is not considered truly portable unless it not only compiles, but also runs
efficiently.
Keywords: parallel I/O, multiprocessor file system
interface, pario-bib
Comment: A more readable explanation of MPI-IO
than the proposed-standard document corbett:mpi-io3. See polished book
version, corbett:mpi-overview-book. See also the slides presented at IOPADS
The MPI-IO interface is being proposed as an extension to
the MPI standard to fill this need. MPI-IO supports a high-level interface to
describe the partitioning of file data among processes, a collective
interface describing complete transfers of global data structures between
process memories and files, asynchronous I/O operations, allowing computation
to be overlapped with I/O, and optimization of physical file layout on
storage devices (disks). Abstract: Thanks to MPI, writing portable message
passing parallel programs is almost a reality. One of the remaining problems
is file I/O. Although parallel file systems support similar interfaces, the
lack of a standard makes developing a truly portable program impossible. It
is not feasible to develop large scientific applications from scratch for
each generation of parallel machine, and, in the scientific world, a program
is not considered truly portable unless it not only compiles, but also runs
efficiently.
Keywords: parallel I/O, file system interface,
pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Abstract: Parallel computer architectures require
innovative software solutions to utilize their capabilities. This statement
is true for system software no less than for application programs. File
system development for the IBM SP product line of computers started with the
Vesta research project, which introduced the ideas of parallel access to
partitioned files. This technology was then integrated with a conventional
Advanced Interactive Executive (AIX) environment to create the IBM AIX
Parallel I/O File System product. We describe the design and implementation
of Vesta, including user interfaces and enhancements to the control
environment needed to run the system. Changes to the basic design that were
made as part of the AIX Parallel I/O File System are identified and
justified.
Keywords: parallel file system, parallel I/O,
Vesta, pario-bib
Comment: Probably the most authoritative
Vesta/PIOFS paper yet. Good description of the system, motivations, etc. Not
as much detail as some, like corbett:vesta-di.
Abstract: Row-Diagonal Parity (RDP) is a new
algorithm for protecting against double disk failures. It stores all data
unencoded, and uses only exclusive-or operations to compute parity. RDP is
provably optimal in computational complexity, both during construction and
reconstruction. Like other algorithms, it is optimal in the amount of
redundant information stored and accessed. RDP works within a single stripe
of blocks of sizes normally used by file systems, databases and disk arrays.
It can be utilized in a fixed (RAID-4) or rotated (RAID-5) parity placement
style. It is possible to extend the algorithm to encompass multiple RAID-4 or
RAID-5 disk arrays in a single RDP disk array. It is possible to add disks to
an existing RDP array without recalculating parity or moving data.
Implementation results show that RDP performance can be made nearly equal to
single parity RAID-4 and RAID-5 performance.
Keywords: fault tolerance, disk failures,
algorithms, row-diagonal parity, RAID, pario-bib
Comment: Awarded best paper.
Keywords: parallel I/O, multiprocessor file system
interface, pario-bib
Comment: Specs of the proposed SIO low-level
interface for parallel file systems. Key features: linear file model,
scatter-gather read and write calls (list of strided segments), asynch
versions of all calls, extensive hint system. Naming structure is
unspecified; no directories specified. Permissions left out. Some control
over client caching and over disk layout. Each file has a (small) 'label',
which is just a little space for application-controlled meta data. Optional
extensions: collective read and write calls, fast copy.
Keywords: multiprocessor file system interface,
parallel I/O, Vesta, pario-bib
Keywords: parallel I/O, multiprocessor file
system, concurrent file checkpointing, multiprocessor file system interface,
Vesta, pario-bib
Comment: See corbett:jvesta. Design of a file
system for a message-passing MIMD multiprocessor to be used for scientific
computing. Separate I/O nodes from compute nodes; I/O nodes and disks are
viewed as a data-staging area. File system runs on I/O nodes only. Files
declustered by record, among physical partitions, each residing on a separate
disk, and each separately growable. Then the user maps logical partitions,
one per process, on the file at open time. These are designed to be
two-dimensional, so that mapping arrays of various strides and contiguities,
with records as the basic unit, is easy. Various consistency and atomicity
requirements. File checkpointing, really snapshotting, is built in. No client
caching, no redundancy for reliability. See also corbett:vesta2,
corbett:vesta3, feitelson:pario.
Abstract: The Vesta Parallel file system is
designed to provide parallel file access to application programs running on
multicomputers with parallel I/O subsystems. Vesta uses a new abstraction of
files: a file is not a sequence of bytes, but rather it can be partitioned
into multiple disjoint sequences that are accessed in parallel. The
partitioning - which can also be changed dynamically - reduces the need for
synchronization and coordiantion during the access. Some control over the
layout of data is also provided, so the layout can be matched with the
anticipated access patterns. The system is fully implemented, and is
beginning to be used by application programmers. The implementation does not
compromise scalability or parallelism. In fact, all data accesses are done
directly to the I/O node that contains the requested data, without any
indirection or access to shared metadata. There are no centralized control
points in the system.
Keywords: parallel I/O, multiprocessor file
system, file system interface, Vesta, pario-bib
Comment: See corbett:jvesta and corbett:vesta* for
other background. Note that since this paper they have put Vesta on top of a
raw disk (using 64 KB blocks) rather than on top of AIX-JFS. They describe
here the structure of Vesta (2-d files, cells, subfiles, etc), the ordering
of bytes within a subfile, hashing of the file name to find the file
metadata, Xrefs instead of directories, caching, asynchronous I/O,
prefetching, shared file pointers, concurrency control, and block-list
structure. Many things, some visible to the user and some not, are new.
Keywords: multiprocessor file system, parallel
I/O, Vesta, pario-bib
Comment: Complete user's manual of the Vesta file
system. Impressive in its completeness (e.g., it has user quotas). Handy for
its detailed description of the interface, but doesn't say much (of course)
about the implementation.
Keywords: multiprocessor file system, file
checkpointing, Vesta, pario-bib
Comment: See also corbett:jvesta, corbett:vesta,
corbett:vesta3, feitelson:pario. A new abstraction and a new interface.
Typical systems use transparent striping, and access modes. They believe that
``optimization requires control''. Need to be able to tell the system what
you want. User-defined or default. Asynch I/O. Concurrency control.
Checkpointing. Export/import to external storage. New abstraction: file is
multiple sequences of records. Each process sees a logical partition of the
file. Physical partition is one or more disks. Logical partition defined in
terms of records. Can repartition without moving data. Rectilinear
decompositions of file data to processors. They can do gather/scatter
requests. Using logical partitions give system the knowledge that user's
accesses are disjoint. Collective operations with consistency checks, vs.
independent access. Collective open defines logical view, then synch, then
check that partitions are disjoint. If not, then they have access modes to
define semantics (more or less the same as other systems). Consider this a
target for HPF, etc. Physical partitioning (record size and number of
partitions) is defined at create time. Can they have different physical or
logical partition sizes in the same file? Future: parallel pipelines,
``out-of-core'' backing store for HPF arrays, high-level operations,
collective operations.
Keywords: parallel I/O, algorithm, pario-bib
Comment: Earlier version available as Dartmouth
tech report PCS-TR93-193. But the most recent and complete version is
Dartmouth PCS-TR94-223, cormen:bmmc-tr.
Keywords: parallel I/O algorithms, pario-bib
Comment: Supercedes cormen:bmmc.
The results indicate
the following. First, good PDM algorithms are usually not I/O bound. Second,
of the four PDM parameters, one (problem size) is a good indicator of I/O
time and running time, one (memory size) is a good indicator of I/O time but
not necessarily running time, and the other two (block size and number of
disks) do not necessarily indicate either I/O or running time. Third, because
PDM algorithms tend not to be I/O bound, using asynchronous I/O can reduce
I/O wait times significantly. The software interface to the PDM is part
of the ViC* run-time library. The interface is a set of wrappers that are
designed to be both efficient and portable across several underlying file
systems and target machines. Abstract: Although several algorithms have been
developed for the Parallel Disk Model (PDM), few have been implemented.
Consequently, little has been known about the accuracy of the PDM in
measuring I/O time and total running time to perform an out-of-core
computation. This paper analyzes timing results on multiple-disk platforms
for two PDM algorithms, out-of-core radix sort and BMMC permutations, to
determine the strengths and weaknesses of the PDM.
Keywords: parallel I/O, parallel I/O algorithm,
compiler, pario-bib
The results indicate the
following. First, good PDM algorithms are usually not I/O bound. Second, of
the four PDM parameters, two (problem size and memory size) are good
indicators of I/O time and running time, but the other two (block size and
number of disks) are not. Third, because PDM algorithms tend not to be I/O
bound, asynchronous I/O effectively hides I/O times. The software
interface to the PDM is part of the ViC* run-time library. The interface is a
set of wrappers that are designed to be both efficient and portable across
several parallel file systems and target machines. Abstract: Although several algorithms have been
developed for the Parallel Disk Model (PDM), few have been implemented.
Consequently, little has been known about the accuracy of the PDM in
measuring I/O time and total time to perform an out-of-core computation. This
paper analyzes timing results on a uniprocessor with several disks for two
PDM algorithms, out-of-core radix sort and BMMC permutations, to determine
the strengths and weaknesses of the PDM.
Keywords: parallel I/O, parallel I/O algorithm,
compiler, pario-bib
Comment: This used to be called cormen:early-vic
but I renamed it because the paper will appear in parcomp.
Abstract: The fast Fourier transform (FFT) plays a
key role in many areas of computational science and engineering. Although
most one-dimensional FFT problems can be solved entirely in main memory, some
important classes of applications require out-of-core techniques. For these,
use of parallel I/O systems can improve performance considerably. This paper
shows how to perform one-dimensional FFTs using a parallel disk system with
independent disk accesses. We present both analytical and experimental
results for performing out-of-core FFTs in two ways: using traditional
virtual memory with demand paging, and using a provably asymptotically
optimal algorithm for the parallel disk model (PDM) of I.S. Vitter and E.A.M.
Shriver (1994). When run on a DEC 2100 server with a large memory and eight
parallel disks, the optimal algorithm for the PDM runs up to 144.7 times
faster than in-core methods under demand paging. Moreover, even including I/O
costs, the normalized times for the optimal PDM algorithm are competitive, or
better than, those for in-core methods even when they run entirely in memory.
Keywords: parallel I/O, out of core, scientific
computing, FFT, pario-bib
Comment: see also cormen:fft2 and cormen:fft3.
Part of a special issue.
Abstract: The Fast Fourier Transform (FFT) plays a
key role in many areas of computational science and engineering. Although
most one-dimensional FFT problems can be solved entirely in main memory, some
important classes of applications require out-of-core techniques. For these,
use of parallel I/O systems can improve performance considerably. This paper
shows how to perform one-dimensional FFTs using a parallel disk system with
independent disk accesses. We present both analytical and experimental
results for performing out-of-core FFTs in two ways: using traditional
virtual memory with demand paging, and using a provably asymptotically
optimal algorithm for the Parallel Disk Model (PDM) of Vitter and Shriver.
When run on a DEC 2100 server with a large memory and eight parallel disks,
the optimal algorithm for the PDM runs up to 144.7 times faster than in-core
methods under demand paging. Moreover, even including I/O costs, the
normalized times for the optimal PDM algorithm are competitive, or better
than, those for in-core methods even when they run entirely in memory.
Keywords: parallel I/O, out of core, scientific
computing, FFT, pario-bib
Abstract: This paper extends an earlier
out-of-core Fast Fourier Transform (FFT) method for a uniprocessor with the
Parallel Disk Model (PDM) to use multiple processors. Four out-of-core
multiprocessor methods are examined. Operationally, these methods differ in
the size of "mini-butterfly" computed in memory and how the data are
organized on the disks and in the distributed memory of the multiprocessor.
The methods also perform differing amounts of I/O and communication. Two of
them have the remarkable property that even though they are computing the FFT
on a multiprocessor, all interprocessor communication occurs outside the
mini-butterfly computations. Performance results on a small workstation
cluster indicate that except for unusual combinations of problem size and
memory size, the methods that do not perform interprocessor communication
during the mini-butterfly computations require approximately 86% of the time
of those that do. Moreover, the faster methods are much easier to implement.
Keywords: parallel I/O, out of core, scientific
computing, FFT, pario-bib
Comment: Extends the work of cormen:fft.
Abstract: This paper extends an earlier
out-of-core Fast Fourier Transform (FFT) method for a uniprocessor with the
Parallel Disk Model (PDM) to use multiple processors. Four out-of-core
multiprocessor methods are examined. Operationally, these methods differ in
the size of "mini-butterfly" computed in memory and how the data are
organized on the disks and in the distributed memory of the multiprocessor.
The methods also perform differing amounts of I/O and communication. Two of
them have the remarkable property that even though they are computing the FFT
on a multiprocessor, all interprocessor communication occurs outside the
mini-butterfly computations; communication that ordinarily occurs in a
butterfly is folded into other data-movement operations. An analysis program
shows that the two methods that use no butterfly communication usually use
less communication overall than the other methods. The analysis program is
fast enough that it can be invoked at run time to determine which of the four
methods uses the least communication. One set of performance results on a
small workstation cluster indicates that the methods without butterfly
communication are approximately 9.5% faster. Moreover, they are much easier
to implement.
Keywords: out of core, parallel I/O, pario-bib
Comment: They find a way to move the
interprocessor communication involved in the out-of-core FFT into a single
BMMC permutation between "super-levels", where each super-level involves
log(M) stages of the FFT. This usually leads to less communication and to
better overall performance. See also cormen:fft and cormen:fft2.
Abstract: FG is a programming environment for
asynchronous programs that run on clusters and fit into a pipeline framework.
It enables the programmer to write a series of synchronous functions and
represents them as stages of an asynchronous pipeline. FG mitigates the high
latency inherent in interprocessor communication and accessing the outer
levels of the memory hierarchy. It overlaps separate pipeline stages that
perform communication, computation, and I/O by running the stages
asynchronously. Each stage maps to a thread. Buffers, whose sizes correspond
to block sizes in the memory hierarchy, traverse the pipeline. FG makes such
pipeline-structured parallel programs easier to write, smaller, and faster.
FG offers several advantages over statically scheduled overlapping and
dynamically scheduled overlapping via explicit calls to thread functions.
First, it reduces coding and debugging time. Second, we find that it reduces
code size by approximately 15-26%. Third, according to experimental results,
it improves performance. Compared with programs that use static scheduling,
FG-generated programs run approximately 61-69% faster on a 16-node Beowulf
cluster. Compared with programs that make explicit calls for dynamically
scheduled threads, FG-generated programs run slightly faster. Fourth, FG
offers various design options and makes it easy for the programmer to explore
different pipeline configurations.
Keywords: asynchronous I/O, pipelined I/O,
pario-bib
Abstract: Several algorithms for parallel disk
systems have appeared in the literature recently, and they are asymptotically
optimal in terms of the number of disk accesses. Scalable systems with
parallel disks must be able to run these algorithms. We present for the first
time a list of capabilities that must be provided by the system to support
these optimal algorithms: control over declustering, querying about the
configuration, independent I/O, and turning off parity, file caching, and
prefetching. We summarize recent theoretical and empirical work that
justifies the need for these capabilities. In addition, we sketch an
organization for a parallel file interface with low-level primitives and
higher-level operations.
Keywords: parallel I/O, multiprocessor file
systems, algorithm, file system interface, dfk, pario-bib
Comment: Describing the file system capabilities
needed by parallel I/O algorithms to effectively use a parallel disk system.
Revised as Dartmouth PCS-TR93-188 (updated).
Abstract: Several algorithms for parallel disk
systems have appeared in the literature recently, and they are asymptotically
optimal in terms of the number of disk accesses. Scalable systems with
parallel disks must be able to run these algorithms. We present a list of
capabilities that must be provided by the system to support these optimal
algorithms: control over declustering, querying about the configuration,
independent I/O, turning off file caching and prefetching, and bypassing
parity. We summarize recent theoretical and empirical work that justifies the
need for these capabilities.
Keywords: parallel I/O, multiprocessor file
systems, algorithm, file system interface, dfk, pario-bib
Comment: Describing the file system capabilities
needed by parallel I/O algorithms to effectively use a parallel disk system.
Cite cormen:integrate.
Abstract: This paper presents asymptotically equal
lower and upper bounds for the number of parallel I/O operations required to
perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel
Disk Model proposed by J.S. Vitter and E.A.M. Shriver (1994). A BMMC
permutation maps a source index to a target index by an affine transformation
over GF(2), where the source and target indices are treated as bit vectors.
The class of BMMC permutations includes many common permutations, such as
matrix transposition (when dimensions are powers of 2), bit-reversal
permutations, vector-reversal permutations, hypercube permutations, matrix
reblocking, Gray-code permutations, and inverse Gray-code permutations. The
upper bound improves upon the asymptotic bound in the previous best known
BMMC algorithm and upon the constant factor in the previous best known
bit-permute/complement (BPC) permutation algorithm. The algorithm achieving
the upper bound uses basic linear-algebra techniques to factor the
characteristic matrix for the BMMC permutation into a product of factors,
each of which characterizes a permutation that can be performed in one pass
over the data. The factoring uses new subclasses of BMMC permutations:
memoryload-dispersal (MLD) permutations and their inverses. These subclasses
extend the catalog of one-pass permutations. Although many BMMC permutations
of practical interest fall into subclasses that might be explicitly invoked
within the source code, this paper shows how to quickly detect whether a
given vector of target addresses specifies a BMMC permutation. Thus, one can
determine efficiently at run time whether a permutation to be performed is
BMMC and then avoid the general-permutation algorithm and save parallel I/Os
by using the BMMC permutation algorithm herein
Keywords: parallel I/O, parallel I/O algorithms,
pario-bib
Keywords: scientific computing, out-of-core
computation, parallel I/O, pario-bib
Comment: Part of a special issue on parallel and
distributed I/O.
Keywords: parallel I/O algorithm, pario-bib
Comment: See also cormen:thesis.
Keywords: parallel I/O, algorithm, pario-bib
Comment: Lots of algorithms for out-of-core
permutation problems. See also cormen:permute, cormen:integrate.
Abstract: This paper describes the functionality
of ViC*, a compiler-like preprocessor for out-of-core C*. The input to ViC*
is a C* program but with certain shapes declared \verb`outofcore`, which
means that all parallel variables of these shapes reside on disk. The output
is a standard C* program with the appropriate I/O and library calls added for
efficient access to out-of-core parallel variables.
Keywords: compiler, out-of-core computation,
parallel I/O, pario-bib
Abstract: We present a sort-first parallel system
for out-of-core rendering of large models on cluster-based tiled displays.
The system renders high-resolution images of large models at interactive
frame rates using off-the-shelf PCs with small memory. Given a model, we use
an out-of-core preprocessing algorithm to build an on-disk hierarchical
representation for the model. At run time, each PC renders the image for a
display tile, using an out-of-core rendering approach that employs multiple
threads to overlap rendering, visibility computation, and disk operations.
The system can operate in approximate mode for real-time rendering, or in
conservative mode for rendering with guaranteed accuracy. Running our system
in approximate mode on a cluster of 16 PCs each with 512 MB of main memory,
we are able to render 12-megapixel images of a 13-million-triangle model with
99.3% of accuracy at 10.8 frames per second. Rendering such a large model at
high resolutions and interactive frame rates would typically require
expensive high-end graphics hardware. Our results show that a cluster of
inexpensive PCs is an attractive alternative to those high-end systems. (36
refs.)
Keywords: cluster based tiled displays,
out-of-core rendering, sort first parallel rendering, pario-bib
Keywords: cooperative caching, distributed file
system, parallel I/O, pario-bib
Comment: Part of jin:io-book; reformatted version
of cortes:cooperative.
Keywords: parallel file system, RAID, cluster
computing, parallel I/O, pario-bib
Abstract: In this paper, we examine some of the
important problems observed in the design of cooperative caches. Solutions to
the coherence, load-balancing and fault-tolerance problems are presented.
These solutions have been implemented as a part of PAFS, a
parallel/distributed file system, and its performance has been compared to
the one achieved by xFS. Using the comparison results, we have observed that
the proposed ideas not only solve the main problems of cooperative caches,
but also increase the overall system performance. Although the solutions
presented in this paper were targeted to a parallel machine, reasonable good
results have also been obtained for networks of workstations.
Keywords: cooperative caching, distributed file
system, parallel I/O, pario-bib
Comment: They make the claim that it is better not
to replicate data into local client caches, rather, it is better to simply
make remote read and write requests to the cached block in whatever memory it
may be. That reduces the overhead (space and time) of replication and
coherency, and leads to better performance. They also present a range of
parity-based fault-tolerance mechanisms, and a load-balancing technique that
reassigns cache buffers to cache-manager processes.
Abstract: Disk arrays, or RAIDs, have become the
solution to increase the capacity and bandwidth of most storage system, but
their usage has some limitations because all the disks in the array have to
be equal. Nowadays, assuming a homogeneous set of disks to build an array is
becoming a not very realistic assumption in many environments, especially in
low-cost clusters of workstations. It is difficult to find a disk with the
same characteristics as the ones in the array and replacing or adding new
disks breaks the homogeneity. In this paper, we propose two
block-distribution algorithms (one for RAID0 and an extension for RAID5) that
can be used to build disk arrays from a heterogeneous set of disks. We also
show that arrays using this algorithm are able to serve many more disk
requests per second than when blocks are distributed assuming that all disks
have the lowest common speed, which is the solution currently being used. (C)
2003 Elsevier Science (USA). All rights reserved.
Keywords: AdaptRaid, block distribution,
heterogeneity; RAID, parallel I/O, pario-bib
Abstract: RAIDs level 5 are one of the most widely
used kind of disk array, but their usage has some limitations because all the
disks in the array have to be equal. Nowadays, assuming a homogeneous set of
disks to build an array is becoming a not very realistic assumption in many
environments, especially in low-cost clusters of workstations. It is
difficult to and a disk with the same characteristics as the ones in the
array and replacing or adding new disks breaks the homogeneity. In this
paper, we propose a block- distribution algorithm that can be used to build
disk arrays from a heterogeneous set of disks. We also show that arrays using
this algorithm are able to serve many more disk requests per second than when
blocks are distributed assuming that all disks have the lowest common speed,
which is the solution currently being used.
Keywords: parallel I/O, RAID, pario-bib
Comment: The web page for the project is
http://people.ac.upc.es/toni/AdaptRaid.html
Abstract: Heterogeneous disk arrays are becoming a
common configuration in many sites and specially in storage area networks
(SAN). As new disks have different characteristics than old ones, adding new
disks or replacing old ones ends up in a heterogeneous disk array. Current
solutions to this kind of arrays do not take advantage of the improved
characteristics of the new disks. In this paper, we present a
block-distribution algorithm that takes advantage of these new
characteristics and thus improves the performance and capacity of
heterogeneous disk arrays compared to current solutions.
Keywords: disk array, parallel I/O, pario-bib
Comment: The technical report associated with this
paper can be found at
ftp://ftp.ac.upc.es/pub/reports/DAC/2000/UPC-DAC-2000-76.ps.Z
Abstract: Clusters of workstations are becoming a
quite popular platform to run high-performance applications. This fact has
stressed the need of high-performance storage systems for this kind of
environments. In order to design such systems, we need adequate tools, which
should be flexible enough to model a cluster of workstations. Currently
available simulators do not allow heterogeneity (several kind of disks),
hierarchies or resource sharing (among others), which are quite common in
clusters. To fill this gap, we have designed and implemented HRaid, which is
a very flexible and easy to use storage-system simulator. In this paper, we
present this simulator, its main abstractions and some simple examples of how
it can be used.
Keywords: simulation, RAID, disk array, storage
system, heterogeneous system, parallel I/O, pario-bib
Abstract: Summary form only given. After these two
decades, it is now a good time to go through all the done work and try to
learn the important lessons all these parallel I/O initiatives have taught
us. This paper aims at giving this global overview. The focus is not on
commercial/academic systems/prototypes, but on the concepts that lay behind
them. These concepts have normally been applied at different levels, and
thus, such an overview can be of interest to many people ranging from the
hardware design to the application implementation. Some of the most important
concepts that are discussed are, among others, data placement (RAIDs, 2D and
3D files, ...), network architectures for parallel I/O (Network attached
devices, SAN, ...), parallel caching and prefetching (cooperative caching,
Informed caching and prefetching, ...), and interfaces (collective I/O, data
distribution interfaces, ...).
Keywords: tutorial, parallel I/O overview,
pario-bib
Comment: Tutorial given at Cluster 2004.
Abstract: A new cooperative caching mechanism,
PACA, along with a caching algorithm, LRU-Interleaved, and an aggressive
prefetching algorithm, Full-File-On-Open, are presented. The caching
algorithm is especially targeted to parallel machines running a
microkernel-based operating system. It avoids the cache coherence problem
with no loss in performance. Comparing our algorithm with N-Chance
Forwarding, in the above environment, better results have been obtained by
LRU-Interleaved. We also evaluate an aggressive prefetching algorithm that
highly increases read performance taking advantage of the huge caches
cooperative caching offers.
Keywords: file caching, multiprocessor file
system, cooperative caching, parallel I/O, pario-bib
Comment: Contact toni@ac.upc.es. See also the a
longer version of the paper, cortes:paca-tr.
Keywords: file caching, multiprocessor file
system, cooperative caching, parallel I/O, pario-bib
Comment: See cortes:paca.
Abstract: In this paper we describe PAFS, a new
parallel/distributed file system. Within the whole file system, special
interest is placed on the caching mechanism. We present a cooperative cache
that has the advantages of cooperation and avoids the problems derived from
the coherence mechanisms. Furthermore, this has been achieved with a
reasonable gain in performance. In order to show the obtained performance, we
present a comparison between PAFS and xFS (a file system that also implements
a cooperative cache).
Keywords: file caching, multiprocessor file
system, cooperative caching, cache coherence, parallel I/O, pario-bib
Comment: Contact toni@ac.upc.es.
Abstract: In this paper we present PAFS, a new
parallel/distributed file system. Within the whole file system, special
interest is placed on the caching and prefetching mechanisms. We present a
cooperative cache that avoids the coherence problem while it continues to be
highly scalable and achieves very good performance. We also present an
aggressive prefetching algorithm that allows full utilization of the big
caches offered by the cooperative cache mechanism. All the results presented
in this paper have obtained through simulation using the Sprite workload.
Keywords: file caching, multiprocessor file
system, cooperative caching, cache coherence, parallel I/O, pario-bib
Comment: A longer, more detailed version of
cortes:pafs.
Abstract: Cooperative caches offer huge amounts of
caching memory that is not always used as well as it could be. We might find
blocks in the cache that have not been requested for many hours. These blocks
will hardly improve the performance of the system while the buffers they
occupy could be better used to speed-up the I/O operations. In this paper, we
present a family of simple prefetching algorithms that increase the
file-system performance significantly. Furthermore, we also present a way to
make any simple prefetching algorithm into an aggressive one that controls
its aggressiveness not to flood the cache unnecessarily. All these algorithms
and mechanisms have proven to increase the performance of two
state-of-the-art parallel/distributed file systems: PAFS and xFS.
Keywords: parallel I/O, file access pattern,
prefetching, caching, simulation, pario-bib
Comment: They present algorithms for "linear
aggressive prefetching" for systems using a cooperative cache. Two prediction
schemes are used: a OBA (one block ahead) and IS_PPM (Interval and size
-prediction by partial match). The aggressive prefetch algorithm continuosly
prefetches data until a miss-prediction occurs. When a mis-prediction occurs,
they realize that they were on the wrong path and start prefetching again
from the mis-predicted block. To limit the aggressiveness of the prefetching,
they only allow one block from each file to be prefetched at a time. If a
single application is running, this forces a parallel reads to only utilize
one disk at a time. They claim, however, that when many files are being
accessed they achieve good disk utilization. They implemented the prefetching
algorithms on the xFS anderson:serverless and PAFS cortes:pafs
file systems. They used a trace-driven simulator DIMEMAS labarta:dip
to obtain performance results for portions of the CHARISMA and Sprite
workloads. The results show that using aggressive prefetching does not
usually load the system more than a system with no prefetching, and
sometimes, it even lowers the disk traffic.
Keywords: parallel I/O, file access pattern,
prefetching, caching, pario-bib
Abstract: Redundant disk arrays are single fault
tolerant, incorporating a layer of error handling not found in nonredundant
disk systems. Recovery from these errors is complex, due in part to the large
number of erroneous states the system may reach. The established approach to
error recovery in disk systems is to transition directly from an erroneous
state to completion. This technique, known as forward error recovery, relies
upon the context in which an error occurs to determine the steps required to
reach completion, which implies forward error recovery is design specific.
Forward error recovery requires the enumeration of all erroneous states the
system may reach and the construction of a forward path from each erroneous
state. We propose a method of error recovery which does not rely upon the
enumeration of erroneous states or the context in which errors occur. When an
error is encountered, we advocate mechanized recovery to an error-free state
from which an operation may be retried. Using a form of backward error
recovery, we are able to manage the complexity of error recovery in redundant
disk arrays without sacrificing performance.
Keywords: parallel I/O, disk array, RAID,
redundancy, reliability, recovery, pario-bib
Comment: Also available in HTML format at
http://www.cs.cmu.edu/Web/Groups/PDL/HTML-Papers/CMG94/c.fm.html.
Keywords: parallel I/O, disk array, RAID,
redundancy, reliability, recovery, pario-bib
Keywords: parallel I/O, RAID, disk array,
reliability, simulation, pario-bib
Comment: See expanded version gibson:raidframe-tr.
Keywords: parallel I/O, file system, network,
pario-bib
Comment: See also coyne:storage.
Keywords: parallel I/O, file system, network,
pario-bib
Comment: See also coyne:hpss. They describe the
National Storage Laboratory at LLNL. Collaboration with many companies. The
idea is to build a combined storage system from many disk and tape components
that is networked to supercomputers. The philosophy is to separate control
and data network traffic, so that the overall control can be managed by a
(relatively) small computer, without the same computer needing to pump all of
the data through it's CPU. The data would go directly from the devices to the
client supercomputer. They also want to support multiple hierarchies of data
storage, so that new technologies can be inserted without disrupting existing
hierarchies. Access interface is layered so that high-level abstractions can
be provided as well as low-level control for those who need it.
Abstract: Grand challenge applications have to
process large amounts of data, and then require high performance IO systems.
Cluster computing is a good alternative to proprietary system for building
cost effective IO intensive platform: some cluster architectures won sorting
benchmark (MinuteSort, Datamation)! Recent advances in IO component
technologies (disk, controller and network) let us expect higher IO
performance for data intensive applications on cluster. The counterpart of
this evolution is that much stress is put on the different buses (memory, IO)
of each node which cannot be scaled. In this paper we investigate a strategy
we called READ2 (Remote Efficient Access to Distant Device) to reduce the
stress. With READ2 any cluster node accesses directly to remote disk: the
remote processor and the remote memory are removed from the control and data
path: Inputs/Outputs don't interfere with the host processor and the host
memory activity. With READ2 strategy, a cluster can be considered as a shared
disk architecture instead of a shared nothing one. This papers describes an
implementation of READ^2 on Myrinet Networks. First experimental results show
IO performance improvement.
Keywords: parallel I/O, pario-bib
Abstract: Rapid increases in computing and
communication performance are exacerbating the long-standing problem of
performance-limited input/output. Indeed, for many otherwise scalable
parallel applications, input/output is emerging as a major performance
bottleneck. The design of scalable input/output systems depends critically on
the input/output requirements and access patterns for this emerging class of
large-scale parallel applications. However, hard data on the behavior of such
applications is only now becoming available. In this paper, we describe the
input/output requirements of three scalable parallel applications (electron
scattering, terrain rendering, and quantum chemistry) on the Intel Paragon
XP/S. As part of an ongoing parallel input/output characterization effort, we
used instrumented versions of the application codes to capture and analyze
input/output volume, request size distributions, and temporal request
structure. Because complete traces of individual application input/output
requests were captured, in-depth, off-line analyses were possible. In
addition, we conducted informal interviews of the application developers to
understand the relation between the codes' current and desired input/output
structure. The results of our studies show a wide variety of temporal and
spatial access patterns, including highly read-intensive and write-intensive
phases, extremely large and extremely small request sizes, and both
sequential and highly irregular access patterns. We conclude with a
discussion of the broad spectrum of access patterns and their profound
implications for parallel file caching and prefetching schemes.
Keywords: file access pattern, file system
workload, workload characterization, parallel I/O, pario-bib
Comment: They use the Pablo instrumentation and
analysis tools to instrument three scalable applications that use heavy I/O:
electron scattering, terrain rendering, and quantum chemistry. They look at
the volume of data moved, the timing of I/O, and the periodic nature of I/O.
They do a little bit with the access patterns of data within each file. They
found a HUGE variation in request sizes, amount of I/O, number of files, and
so forth. Their primary conclusion is thus that file systems should be
adaptable to different access patterns, preferably under control of the
application. Note proceedings only available on CD-ROM or WWW.
Keywords: parallel I/O, out-of-core algorithm,
computational geometry, data structure, pario-bib
Comment: See also the component papers
vitter:survey, arge:lower, crauser:segment, grossi:crosstrees, toledo:survey.
Not clear to what extent these papers are about *parallel* I/O.
Keywords: parallel I/O, disk architecture, disk
array, pario-bib
Comment: Glossy from Cray describing their new
disk subsystem: up two four controllers and up to four ``drives'', each of
which actually have four spindles. Thus, a full subsystem has 16 disks. Each
drive or controller sustains 9.6 MBytes/sec sustained, for a total of 38.4
MBytes/sec. Each drive has 4.8 GBytes, for a total of 19.2 Gbytes. Access
time per drive is 2-46.6 msec, average 24 msec. They don't say how the 4
spindles within a driver are controlled or arranged.
Keywords: parallel I/O, parallel file system,
pario-bib
Comment: Man pages for his Flex version of file
interface. See crockett:par-files.
Keywords: parallel I/O, file access pattern,
parallel file system, pario-bib
Comment: Two views of a file: global (for
sequential programs) and internal (for parallel programs). Standardized forms
for these views, for long-lived files. Temp files have specialized forms. The
access types are sequential, partitioned, interleaved, and self-scheduled,
plus global random and partitioned random. He relates these to their best
storage patterns. No mention of prefetching. Buffer cache only needed for
direct (random) access. The application must specify the access pattern
desired.
Keywords: parallel I/O, disk array, pario-bib
Keywords: workload characterization, scientific
computing, parallel programming, message passing, pario-bib
Comment: Some mention of I/O.
Keywords: workload characterization, scientific
computing, parallel programming, message passing, pario-bib
Comment: Some mention of I/O, though only in a
limited way. Average 1207B/MFlop. Some of the applications do I/O throughout
their run (2400B/MFlop avg), while others only do I/O at the beginning or end
(14B/MFlop avg). But I/O is bursty, so larger bandwidths are suggested. The
applications are parallel programs running on Intel Delta, nCUBE/1, nCUBE/2,
and are in C, FORTRAN, or both.
Abstract: Dynamic simulations based on
time-varying inputs are extremely I/O intensive. This is shown by industrial
applications generating environmental projections based on
seasonal-to-interannual climate forecasts which have a compute to data-access
ratio of O(n) leading to significant performance degradation. Exploitation of
compression techniques such as Run-Length-Encoding (RLE) significantly
reduces the I/O bottleneck and storage requirements. Unfortunately,
traditional RLE algorithms do not perform well in a parallel-vector platform
such as the Cray architecture. This paper describes the design and
implementation of a new RLE algorithm based on data chunking and packing that
exploits the Cray gather-scatter vector hardware and multiple processors.
This innovative approach reduces I/O and file storage requirements on average
by an order of magnitude. Data intensive applications such as the integration
of environmental and global climate models now become practical in a
realistic time-frame.
Keywords: parallel I/O application, compression,
pario-bib
Comment: In a Special Issue on I/O in Parallel
Applications, volume 12, numbers 3 and 4.
Keywords: parallel I/O, pario-bib
Keywords: parallel I/O, MIMD, pario-bib
Comment: Journalized version of
debenedictis:pario, debenedictis:ncube, and delrosario:nCUBE.
Keywords: parallel file system, parallel I/O,
pario-bib
Comment: Interesting paper. Describes their
mechanism for mapping I/O so that the file system knows both the mapping of a
data structure into memory and on the disks, so that it can do the
permutation and send the right data to the right disk, and back again.
Interesting Unix-compatible interface. Needs to be extended to handle complex
formats.
Keywords: parallel I/O, multiprocessor file
system, file system interface, pario-bib
Comment: Looks like they give the byte-level
mapping, then do normal reads and writes; mapping routes the data to and from
the correct place. But it does let you intermix comp with I/O. Elegant
concept. Nice interface. Works best for cases where 1) data layout known in
advance, data format is known, and mapping is regular enough for easy
specification. I think that irregular or unknown mappings could still be done
with a flat mapping.
Keywords: parallel I/O, Unix, pario-bib
Comment: A more polished version of his other
papers with del Rosario. The mapping-based mechanism is released in nCUBE
software 3.0. It does support shared file pointers for self-scheduled I/O, as
well as support for variable-length records, and asynchronous I/O (although
the primary mechanism is for synchronous, i.e., SPMD, I/O). The basic idea of
scalable pipes (between programs, devices, etc.) with mappings that determine
routings to units seems like a good idea.
Abstract: The Direct Access File System (DAFS) is
a new, fast, and lightweight remote file system protocol. DAFS targets the
data center by addressing the performance and functional needs of clusters of
application servers. We call this the local file sharing environment. File
access performance is improved by utilizing Direct Access Transports, such as
InfiniBand, Remote Direct Data Placement, and the Virtual Interface
Architecture. DAFS also enhances file sharing semantics compared to prior
network file system protocols. Applications using DAFS through a user-space
I/O library can bypass operating system overhead, further improving
performance. We present performance measurements of an IP-based DAFS network,
demonstrating the DAFS protocol's lower client CPU requirements over
commodity Gigabit Ethernet. We also provide the first multiprocessor scaling
results for a well-known application (GNU gzip) converted to use DAFS.
Keywords: direct access file system, dafs, remote
dma, pario-bib
Keywords: parallel I/O, parallel file system,
pario-bib
Comment: More detail on the mapping functions, and
more flexible mapping functions (can be user specified, or some from a
library). Striped disks, parallel pipes, graphics, and HIPPI supported.
Keywords: parallel I/O, survey, pario-bib
Comment: Nice summary of grand-challenge and other
applications, and their I/O needs. Points out the need for quantitative
studies of workloads. Comments on architectures, eg, the advent of per-node
disk devices. OS problems include communication latency, data decomposition,
interface, prefetching and caching, and checkpointing. Runtime system and
compilers are important, particularly in reference to data-mapping and
re-mapping (see delrosario:two-phase). Persistent object stores and
networking are mentioned briefly.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: They show performance measurements of
various data distributions on an nCUBE and the Touchstone Delta, for reading
matrix from a column-major file striped across disks, into some distribution
across procs. Distributions that don't match the I/O distribution are really
terrible, due to having more, smaller requests, and sometimes mismatching the
stripe size (getting seg-like contention) or block size (reading partial
blocks). They find it is better to read the file using the `best'
distribution, then to reshuffle the data in memory. Big speedups.
Keywords: parallel I/O, parallel file system,
heterogeneous, pario-bib
Comment: They are planning a parallel file system
that is layered on top of standard workstation file systems, to be used by
parallel applications on heterogeneous workstation clusters. All in
user-level libraries, and on a per-application basis, application programs
can distributed their data among many files on many machines. They plan to
use a mapped interface like that of debenedictis:modular, and support
efficient collective I/O in ways reminiscent of bennett:jovian and
kotz:diskdir. Published as harry:vipfs.
Abstract: Many parallel application areas that
exploit massive parallelism, such as climate modeling, require massive
storage systems for the archival and retrieval of data sets. As such,
advances in massively parallel computation must be coupled with advances in
mass storage technology in order to satisfy I/O constraints of these
applications. We demonstrate the effects of such I/O-computation disparity
for a representative distributed information system, NASA's Earth Observing
System Distributed Information System (EOSDIS). We use performance modeling
to identify bottlenecks in EOSDIS for two representative user scenarios from
climate change research.
Keywords: climate modeling, performance modeling,
parallel I/O, pario-bib
Keywords: parallel I/O, database, GAMMA, pario-bib
Comment: Better to cite dewitt:gamma3.
Multiprocessor (VAX) DBMS on a token ring with disk at each processor. They
thought this was better than separating disks from processors by network
since then network must handle all I/O rather than just what needs to
move. Conjecture that shared memory might be best interconnection network.
Relations are horizontally partitioned in some way, and each processor reads
its own set and operates on them there.
Keywords: parallel I/O, database, performance
analysis, Teradata, GAMMA, pario-bib
Comment: Compared Gamma with Teradata. Various
operations on big relations. See fairly good linear speedup in many cases.
They vary only one variable at a time. Their bottleneck was at the
memory-network interface.
Keywords: parallel I/O, database, GAMMA, pario-bib
Comment: Almost identical to dewitt:gamma, with
some updates. See that for comments, but cite this one. See also
dewitt:gamma3 for a more recent paper.
Keywords: parallel I/O, database, GAMMA, pario-bib
Comment: An updated version of dewitt:gamma2, with
elements of dewitt:gamma-dbm. Really only need to cite this one. This is the
same basic idea as dewitt:gamma2, but after they ported the system from the
VAXen to an iPSC/2. Speedup results good. Question: how about comparing it to
a single-processor, single-disk system with increasing disk bandwidth? That
is, how much of their speedup comes from the increasing disk bandwidth, and
how much from the actual use of parallelism?
Keywords: database, parallel computing, parallel
I/O, pario-bib
Comment: They point out that the comments of
boral:critique - that database machines were doomed - did really not come
true. Their new thesis is that specialized hardware is not necessary and has
not been successful, but that parallel database systems are clearly
succesful. In particular, they argue for shared-nothing layouts. They survey
the state-of-the-art parallel DB systems. Earlier version in Computer
Architecture News 12/90.
Keywords: parallel I/O, parallel database,
external sorting, pario-bib
Comment: Comparing exact and probabilistic
splitting for external sorting on a database. Model and experimental results
from Gamma machine. Basically, the idea is to decide on a splitting vector,
which defines $N$ buckets for an $N$-process program, and have each program
read its initial segment of the data and send each element to the appropriate
bucket (other process). All elements received are written to disks as small
sorted runs. Then each process mergesorts its runs. Probabilistic split uses
only a sample of the elements to define the vector.
Keywords: Carla, Bridge, multiprocessor file
system, Butterfly, parallel I/O, pario-bib
Comment: See also dibble:*
Keywords: parallel I/O, sorting, merging, parallel
file reference pattern, pario-bib
Comment: Based on Bridge file system (see
dibble:bridge). Parallel external merge-sort tool. Sort file on each disk,
then do a parallel merge. The merge is serialized by the token-passing
mechanism, but the I/O time dominates. The key is to keep disks busy
constantly. Uses some read-ahead, write-behind to control fluctuations in
disk request timing. Analytical analysis of the algorithm lends insight and
matches well with the timings. Locality is a big win in Bridge tools.
Keywords: parallel I/O, external sorting, merging,
parallel file reference pattern, pario-bib
Comment: Subset of dibble:sort. Extra comments to
distinguish from striping and RAID work. Good point that those projects are
addressing a different bottleneck, and that they can provide essentially
unlimited bandwidth to a single processor. Bridge could use those as
individual file systems, parallelizing the overall file system, avoiding the
software bottleneck. Using a very-reliable RAID at each node in Bridge could
safeguard Bridge against failure for reasonable periods, removing reliability
from Bridge level.
Keywords: parallel I/O, external sorting, merging,
parallel file system, pario-bib
Comment: Also TR 334. Mostly covered by other
papers, but includes good introduction, discussion of reliability and
maintenance issues, and implementation. Short mention of prefetching implied
that simple OBL was counter-productive, but later tool-specific buffering
with read-ahead was often important. The three interfaces to the PIFS server
are interesting. A fourth compromise might help make tools easier to write.
Abstract: In this paper, we evaluate the impact on
performance of various implementation techniques for collective I/O
operations, and we do so across four important parallel architectures. We
show that a naive implementation of collective I/0 does not result in
significant performance gains for any of the architectures, but that an
optimized implementation does provide excellent performance across all of the
platforms under study. Furthermore, we demonstrate that there exists a single
implementation strategy that provides the best performance for all four
computational platforms. Next, we evaluate implementation techniques for
thread-based collective I/O operations. We show that the most obvious
implementation technique, which is to spawn a thread to execute the whole
collective I/O operation in the background, frequently provides the worst
performance, often performing much worse than just executing the collective
I/O routine entirely in the foreground. To improve performance, we explore an
alternate approach where part of the collective I/O operation is performed in
the background, and part is performed in the foreground. We demonstrate that
this implementation technique can provide significant performance gains,
offering up to a 50% improvement over implementations that do not attempt to
overlap collective I/O and computation.
Keywords: parallel I/O, collective I/O, pario-bib,
parallel architecture
While there has been significant research in high-performance I/O using
languages such as C, C++, and Fortran, there has been relatively little
research into the I/O capabilities of Java. In this paper, we evaluate the
I/O capabilities of Java for high-performance computing. We examine several
approaches that attempt to provide high-performance I/O-many of which are
not obvious at first glance-and investigate their performance in both
parallel and multithreaded environments. We also provide suggestions for
expanding the I/O capabilities of Java to better support the needs of
high-performance computing applications. Abstract: Java is quickly becoming the preferred
language for writing distributed applications because of its inherent support
for programming on distributed platforms. In particular, Java provides
compile-time and run-time security, automatic garbage collection, inherent
support for multithreading, support for persistent objects and object
migration, and portability. Given these significant advantages of Java, there
is a growing interest in using Java for high-performance computing
applications. To be successful in the high-performance computing domain,
however, Java must have the capability to efficiently handle the significant
I/O requirements commonly found in high-performance computing applications.
Keywords: parallel I/O, Java, pario-bib
In this
paper, we explore the issues associated with implementing collective I/O in
the background using threads. The most natural approach is to simply spawn
off an I/O thread to perform the collective I/O in the background while the
main thread continues with other computation. However, our research
demonstrates that this approach is frequently the worst implementation
option, often performing much more poorly than just executing collective I/O
completely in the foreground. To improve the performance of thread-based
collective I/O, we developed an alternate approach where part of the
collective I/O operation is performed in the background, and part is
performed in the foreground. We demonstrate that this new technique can
significantly improve the performance of thread-based collective I/O,
providing up to an 80% improvement over sequential collective I/O (where
there is no attempt to overlap computation with I/O). Also, we discuss one
very important application of this research which is the implementation of
the split-collective parallel I/O operations defined in MPI 2.0.
Abstract: Massively parallel computers are
increasingly being used to solve large, I/O intensive applications in many
different fields. For such applications, the I/O requirements quite often
present a significant obstacle in the way of achieving good performance, and
an important area of current research is the development of techniques by
which these costs can be reduced. One such approach is collective I/O,
where the processors cooperatively develop an I/O strategy that reduces the
number, and increases the size, of I/O requests, making a much better use of
the I/O subsystem. Collective I/O has been shown to significantly reduce the
cost of performing I/O in many large, parallel applications, and for this
reason serves as an important base upon which we can explore other mechanisms
which can further reduce these costs. One promising approach is to use
threads to perform the collective I/O in the background while the main
thread continues with other computation in the foreground.
Keywords: parallel I/O, multithread programming,
collective I/O, disk-directed I/O, two-phase I/O, pario-bib
Comment: They examine an implementation of
collective I/O in MPI2 such that the collective I/O is done in the
background, using a thread, while the computation continues. They found that
the performance can be quite disappointing, because of the competition for
the CPU between the computational thread and the background thread executing
the redistribution phase of the I/O operation. They get better results by
doing the redistribution in the foreground, making the computation wait, and
then doing the I/O in the background thread while the computation continues.
Results from four major parallel platforms, but only for write operations.
Keywords: parallel I/O, neural network, pario-bib
Comment: An application that reads large files,
sequentially, on CM2 with DataVault.
Keywords: scientific application, parallel I/O,
ocean modeling, climate modeling, pario-bib
Comment: They describe the approaches taken to
optimize an out-of-core parallel ocean model simulation on parallel
distributed-memory machines. The original code used fixed size memory windows
to store the in-core portions of dataset on the machine. The code used the
same approach for machines that had enough memory to store the entire data
set in-core, except rather than reading and writing to disk, the code copied
to/from ramdisk (very copy intensive). The new code added an option to allow
the entire dataset to be run in-core. Another place where the code could be
optimized was in the writing of the dataset. For computational efficiency,
the data was stored in memory as an array U(ix,iz,iy), but other applications
needed the data stored on disk as U(ix,iy,iz). To optimize the I/O, the new
code allocated additional processors to gather and re-organize and write the
data to disk (much like Salvo).
Keywords: RAID, disk array, network file system,
parallel I/O, pario-bib
Comment: See also chen:raid2. The only significant
addition in this paper is a discussion of the performance of the RAID-II
running an LFS file system.
Keywords: parallel I/O, pario-bib
Comment: RAID-3 striping across drives in a tape
robot, using 3 data plus one parity. Tape-switch time is very high, ie, 4
minutes. Switching four tapes at the same time would only get a little
overlap, because there is only one robot arm. Assume large request size.
Striping is much faster when only one request is considered, but with many
requests outstanding, response time goes way down due to limited concurrency.
More readers with the same stripe group size alleviate the contention and
allow concurrency. Faster readers is the most important thing to improve
performance, more important than improving robot speed. As both speeds
improve the benefit of striping diminishes. Seems like this could be
expressed in a simple equation...
Keywords: intel, ncube, hypercube, multiprocessor
architecture, performance, parallel I/O, pario-bib
Comment: An excellent paper presenting lots of
detailed performance measurements on the iPSC/1, iPSC/2, iPSC/860, nCUBE
3200, and nCUBE 6400: arithmetic, FLOPS, communication, I/O. Tables of
numbers provide details needed for simulation. iPSC/860 definitely is
fastest, but way out of balance wrt communication vs. computation. Number of
message hops is not so important in newer machines.
Keywords: parallel I/O, scheduling, pario-bib
Comment: They note that the set of data transfers
in a parallel I/O architecture can be expressed as a graph coloring problem.
Realistically, a centralized solution is not possible because the information
is inherently distributed. So they develop some distributed algorithms and
experimentally compare them to the centralized algorithm. They get within 5%
and do better than earlier algorithms.
Abstract: A growing imbalance in CPU (central
processing unit) and I/O (input/output) speeds has led to a communications
bottleneck in distributed architectures, especially for data intensive
applications such as multimedia information systems, databases, and grand
challenge problems. Our solution is to schedule parallel I/O operations
explicitly. We present a class of decentralized scheduling algorithms that
eliminate contention for I/O ports while maintaining an efficient use of
bandwidth. These algorithms based on edge coloring and matching of bipartite
graphs, rely upon simple heuristics to obtain shorter schedules. We use
simulation to evaluate the ability of our algorithms to obtain near optimal
solutions in a distributed context, and compare our work with that of other
researchers. Our results show that our algorithms produce schedules within 5 of the optimal schedule, a substantial improvement over existing algorithms.
(20 refs.)
Keywords: randomized edge coloring, scheduling
algorithms, bipartite graphs, parallel I/O, pario-bib
Keywords: parallel I/O algorithms, pario-bib
Comment: They devise some decentralized algorithms
to generate schedules for data transfers between a set of clients and a set
of servers when the complete set of transfers is known in advance, and the
clients and servers are fairly tightly synchronized. They concentrate on the
limitation that clients and servers may each only participate in one transfer
at any given moment; interconnect bandwidth is not an issue. Their
simulations show that their algorithms come within 20% of optimal.
Although centralized
algorithms for batch scheduling of parallel I/O operations have previously
been developed, they are not be appropriate for all applications and
architectures. We develop a class of decentralized algorithms for scheduling
parallel I/O operations, where the objective is to reduce the time required
to complete a given set of transfers. These algorithms, based on
edge-coloring and matching of bipartite graphs, rely upon simple heuristics
to obtain shorter schedules. We present simulation results indicating that
the best of our algorithms can produce schedules whose length (or makespan)
is within 2 - 20% of the optimal schedule, a substantial improvement on
previous decentralized algorithms. We discuss theoretical and experimental
work in progress and possible extensions. Abstract: The cost of data transfers, and in
particular of I/O operations, is a growing problem in parallel computing.
This performance bottleneck is especially severe for data-intensive
applications such as multimedia information systems, databases, and Grand
Challenge problems. A promising approach to alleviating this bottleneck is to
schedule parallel I/O operations explicitly.
Keywords: parallel I/O, distributed scheduling
algorithm, pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Keywords: parallel computer architecture, MIMD,
pario-bib
Comment: Basically the same architecture as the
nCUBE/2, scaled up. Eight to 65K processors, each 50 MIPS and 100 DP MFLOPS,
initially 50 MHz. RISC. 16 hypercube channels and 2 I/O channels per
processor. CPU chip includes MMU, TLB, I- and D-cache, hypercube and I/O
channels, and memory interface. The channels have DMA support built-in (5
usec startup overhead, worst-case end-to-end latency 10 usec), and can talk
directly to the memory interface or to the cache. 64-bit virtual address
space, with 48 bits implemented. Hardware support for distributed virtual
memory. Separate 16-node hypercube is used for I/O processing, with up to 400
disks attached. Packaging includes multi-chip module with DRAMs stacked
directly on the CPU chip, fluid-cooled, so that an entire node is one
package, with the 18 network links as essentially its only external
connections.
Keywords: parallel I/O, disk caching, parallel
file system, log-structured file system, Intel iPSC/2, pario-bib
Comment: Essentially a small literature survey. No
new ideas here, but it is a reasonable overview of the situation. Mentions
caching, striping, disk layout optimization, log-structured file systems, and
Bridge and Intel CFS. Plugs their ``Swift'' architecture (see cabrera:pario).
Keywords: parallel I/O, parallel architecture,
performance evaluation, pario-bib
Comment: See el-ghazawi:mpio.
Keywords: parallel I/O, parallel architecture,
performance evaluation, pario-bib
Comment: See el-ghazawi:mp1.
Keywords: parallel file system, parallel I/O,
pario-bib
Comment: See also elford:ppfs-tr, huber:ppfs.
Keywords: parallel file system, parallel I/O,
pario-bib
Comment: See also elford:ppfs-detail,
huber:ppfs-scenarios, huber:ppfs.
Keywords: trends, disk technology, disk array,
parallel I/O, pario-bib
Keywords: Carla, multiprocessor file system,
Bridge, Butterfly, parallel I/O, pario-bib
Experiments have
been conducted with an interleaved file system testbed on the Butterfly Plus
multiprocessor. Results of these experiments suggest that 1) the hit ratio,
the accepted measure in traditional caching studies, may not be an adequate
measure of performance when the workload consists of parallel computations
and parallel file access patterns, 2) caching with prefetching can
significantly improve the hit ratio and the average time to perform an I/O
operation, and 3) an improvement in overall execution time has been observed
in most cases. In spite of these gains, prefetching sometimes results in
increased execution times (a negative result, given the optimistic nature of
the study). We explore why is it not trivial to translate savings on
individual I/O requests into consistently better overall performance and
identify the key problems that need to be addressed in order to improve the
potential of prefetching techniques in this environment. Abstract: The problem of providing file I/O to
parallel programs has been largely neglected in the development of
multiprocessor systems. There are two essential elements of any file system
design intended for a highly parallel environment: parallel I/O and effective
caching schemes. This paper concentrates on the second aspect of file system
design and specifically, on the question of whether prefetching blocks of the
file into the block cache can effectively reduce overall execution time of a
parallel computation, even under favorable assumptions.
Keywords: dfk, parallel file system, prefetching,
disk caching, MIMD, parallel I/O, pario-bib
Abstract: NonStop SQL is an implementation of
ANSI/ISO SQL on Tandem Computer Systems. In its second release, NonStop SQL
transparently and automatically implements parallelism within an SQL
statement by exploiting Tandem's multiprocessor architecture. For basic
queries on a uniform database, it achieves performance that is near-linear
with respect to the number of processors and disks used. The authors describe
benchmarks demonstrating these results and the technology used to achieve
them.
Keywords: parallel database, parallel
architecture, parallel I/O, pario-bib
Comment: They (briefly) describe the Tandem
NonStop system, including their disk nodes (which contain CPU, memory, and
disk) and their use. A query involves sending a request to all the disk
nodes, who independently read the appropriate data from their local disk,
filter out all the interesting records, and send only those interesting
records to the originator for processing. This is an early example of smart
(programmable) I/O nodes.
Keywords: multiprocessor architecture, pario-bib
Comment: A nice summary of the Paragon
architecture and OS. Some information that is not found in Intel's technical
summary, and with much less marketing hype. But, it was written in April 1993
with a look to the future, so it may represent things that are not ready yet.
Network interface allows user-mode msgs, DMA direct to user space if receive
has been posted; else there is a new queue for every possible sending
processor. They plan to expand the nodes to 4-processors and 64-128 MB. PFS
stripes across RAIDs. Now SCSI-1 with 5 MB/s, later 10 MB/s SCSI-2, then 20
MB/s fast SCSI-2. See also intel:paragon.
Abstract: As network latency drops below disk
latency, access time to a remote disk will begin to approach local disk
access time. The performance of I/O may then be improved by spreading disk
pages across several remote disk servers and accessing disk pages in
parallel. To research this we have prototyped a data page server called a
Page File. This persistent data type provides a set of methods to access disk
pages stored on a cluster of remote machines acting as disk servers. The goal
is to improve the throughput of database management system or other I/O
intensive application by accessing pages from remote disks and incurring disk
latency in parallel. This report describes the conceptual foundation and the
methods of access for our prototype.
Keywords: parallel I/O, network, virtual memory,
parallel database, pario-bib
Comment: An early document on a system under
development. It declusters pages of a file across many page servers, and
provides an abstraction of a linearly ordered collection of pages. The
intended use is by database systems. As it stands now, there is little here
other than block declustering, and thus, nothing new to the I/O community.
Perhaps later they will develop interesting new caching or prefetching
strategies.
Abstract: Remotely sensed imagery has been used
for developing and validating various studies regarding land cover dynamics.
However, the large amounts of imagery collected by the satellites are largely
contaminated by the effects of atmospheric particles. The objective of
atmospheric correction is to retrieve the surface reflectance from remotely
sensed imagery by removing the atmospheric effects. We introduce a number of
computational techniques that lead to a substantial speedup of an atmospheric
correction algorithm based on using look-up tables. Excluding I/O time, the
previous known implementation processes one pixel at a time and requires
about 2.63 seconds per pixel on a SPARC-10 machine, while our implementation
is based on processing the whole image and takes about 4-20 microseconds per
pixel on the same machine. We also develop a parallel version of our
algorithm that is scalable in terms of both computation and I/O. Experimental
results obtained show that a Thematic Mapper (TM) image (36 MB per band, 5
bands need to be corrected) can be handled in less than 4.3 minutes on a
32-node CM-5 machine, including I/O time.
Keywords: remote sensing, parallel I/O
application, pario-bib
Comment: Note proceedings only on CD-ROM or WWW.
Keywords: multiprocessor file system, parallel
I/O, Vesta, pario-bib
Comment: Part of jin:io-book. Excellent survey.
Reformatted version of feitelson:pario.
Abstract: Applications executing on massively
parallel supercomputers require a high aggregate bandwidth of I/O with low
latency. This requirement cannot be satisfied by an external file server.
Once solution is to employ an internal parallel I/O subsystem, in which I/O
nodes with DASD are linked wo the same interconnection network that connects
the compute nodes. The option of increasing the number of I/O nodes together
with the number of compute nodes allows for a balanced architecture. Indeed,
most multicomputer vendors provide internal parallel I/O subsystems as part
of their product offerings. However, these systems typically attempt to
preserve a Unix-compatible interface, hiding or abstracting the parallelism.
New interfaces may be required to fully utilize the capabilities of Parallel
I/O.
Keywords: multiprocessor file system, parallel
I/O, Vesta, pario-bib
Comment: A very nice survey of multiprocessor file
systems issues. Published version of feitelson:pario-tr.
Keywords: multiprocessor file system, parallel
I/O, Vesta, pario-bib, OS94W
Comment: A very nice survey of multiprocessor file
systems issues. They make a good point that I/O needs would increase
if I/O capabilities increase, because people would output more interations,
more complete data sets, etc.\ They make the case for internal file systems,
the use of dedicated I/O nodes, the attachment of every RAID to two I/O nodes
for reliability, the Vesta interface, and user control over the view of a
parallel file. See also corbett:vesta*. Published as feitelson:pario.
Keywords: parallel I/O, pario-bib
Comment: How to deal with stdin/stdout on a
parallel processor. Basically, each task is given its own window, where the
user can see the output and type input to that task. Then, they have a window
for LEDs, ie, little squares, one for each task. The square changes color
depending on the situation. The default is to turn green when output is
available, red when waiting for input, and white when the window is currently
open. Clicking on these opens the appropriate window, so there is some
control over which windows you are watching. They also provide a programmer
interface to allow the programmer to control the LED color.
Keywords: parallel I/O, multiprocessor file
system, Vesta, pario-bib
Comment: See feitelson:vesta-perf-tr.
Keywords: parallel I/O, multiprocessor file
system, Vesta, pario-bib
Comment: Cite feitelson:vesta-perf. A good
performance study of Vesta running on an SP-1. See corbett:jvesta for
ultimate reference. In all, Vesta performed very well both for single-node
and multiple-node performance. I wish that they had tried some very small
BSUs; at one point they tried 16-byte BSUs and the performance looked very
poor. Section on I/O vectors is confusing.
Keywords: parallel I/O, parallel I/O interface,
pario-bib
Comment: Part of jin:io-book.
Abstract: A fault tolerant parallel virtual file
system is designed and implemented to provide high I/O performance and high
reliability. A queuing model is used to analyze in detail the average
response time when multiple clients access the system. The results show that
I/O response time is with a function of several operational parameters. It
decreases with the increase in I/O buffer hit rate for read requests, write
buffer size for write requests and number of server nodes in the parallel
file system, while higher I/O requests arrival rate increases I/O response
time.
Keywords: fault-tolerance, PVFS, perforamance
analysis, pario-bib
Abstract: Without any additional cost, all the
disks on the nodes of a cluster can be connected together through CEFT-PVFS,
an RAID-10 style parallel file system, to provide a multi-GB/s parallel I/O
performance. I/O response time is one of the most important measures of
quality of service for a client. When multiple clients submit data-intensive
jobs at the same time, the response time experienced by the user is an
indicator of the power of the cluster. In this paper, a queuing model is used
to analyze in detail the average response time when multiple clients access
CEFT-PVFS. The results reveal that response time is with a function of
several operational parameters. The results show that I/O response time
decreases with the increases in I/O buffer hit rate for read requests, write
buffer size for write requests and the number of server nodes in the parallel
file system, while the higher the I/O requests arrival rate, the longer the
I/O response time. On the other hand, the collective power of a large cluster
supported by CEFT-PVFS is shown to be able to sustain a steady and stable I/O
response time for a relatively large range of the request arrival rate.
Keywords: PVFS, parallel I/O, I/O response time,
pario-bib
Abstract: In a previous work [Ferragina-Grossi,
ACM STOC 95], we proposed a text indexing data structure for secondary
storage, which we called SB-tree, that combines the best of B-trees and
suffix arrays, overcoming the limitations of inverted files, suffix arrays,
suffix trees, and prefix B-trees. In this paper we study the performance of
SB-trees in a practical setting, performing a set of searching and updating
experiments. Improved performance was obtained by a new space efficient and
alphabet-independent organization of the internal nodes of the SB-tree, and a
new batch insertion procedure that avoids thrashing.
Keywords: out-of-core algorithm, parallel I/O,
pario-bib
Keywords: out-of-core algorithm, parallel I/O,
pario-bib
In this paper, we present a set of language extensions
and a prototype compiler for supporting high-level object-oriented
programming of data intensive reduction operations over multidimensional
data. We have chosen a dialect of Java with data-parallel extensions for
specifying a collection of objects, a parallel for loop, and reduction
variables as our source high-level language. Our compiler analyzes parallel
loops and optimizes the processing of datasets through the use of an existing
run-time system, called active data repository (ADR). We show how loop
fission followed by interprocedural static program slicing can be used by the
compiler to extract required information for the run-time system. We present
the design of a compiler/run-time interface which allows the compiler to
effectively utilize the existing run-time system. A prototype compiler
incorporating these techniques has been developed using the Titanium
front-end from Berkeley. We have evaluated this compiler by comparing the
performance of compiler generated code with hand customized ADR code for
three templates, from the areas of digital microscopy and scientific
simulations. Our experimental results show that the performance of compiler
generated versions is, on the average 21% lower, and in all cases within a
factor of two, of the performance of hand coded versions. Abstract: Processing and analyzing large volumes
of data plays an increasingly important role in many domains of scientific
research. High-level language and compiler support for developing
applications that analyze and process such datasets has, however, been
lacking so far.
Keywords: parallel I/O, parallel applications,
data parallel, pario-bib
Keywords: pario-bib, application
This
paper describes a client/server application that emulates a high power light
microscope. They use wavelet compression to reduce the size of each of the
electronic slides and they use a parallel data server much like the ones used
for sattelite image data (see chang:titan) to service data requests.
Comment: Best Application Paper award.
Keywords: parallel I/O, multiprocessor file
system, benchmark, pario-bib
Comment: See also carter:benchmark. Some
preliminary results from one of their benchmarks. Note: ``I was only using a
single Cray disk with a maximum transfer rate of 9.6MBytes/sec.'' -
Fineberg.
Abstract: MPI-IO provides a demonstrably efficient
portable parallel Input/Output interface, compatible with the MPI standard.
PMPIO is a "reference implementation" of MPI-IO, developed at NASA Ames
Research Center. To date, PMPIO has been ported to the IBM SP-2, SGI and Sun
shared memory workstations, the Intel Paragon, and the Cray J90. Preliminary
results using the PMPIO implementation of MPI-IO show an improvement of as
much as a factor of 20 on the NAS BTIO benchmark compared to a Fortran based
implementation. We show comparative results on the SP-2 Paragon, and SGI
architectures.
Keywords: parallel I/O, pario-bib
Keywords: parallel I/O, hypercube, parallel file
system, pario-bib
Comment: For hypercube-like architectures.
Interleaved files, though flexible. Separate network for I/O, maybe not
hypercube. I/O is blocked and buffered - no coherency or prefetching issues
discussed. Buffered close to point of use. Parallel access is ok. Broadcast
supported? I/O nodes distinguished from comp nodes. I/O hooked to front-end
too. See hadimioglu:fs and hadimioglu:hyperfs
Abstract: Increased computer networking has
sparked a resurgence of the `on-line' revolution of the 1970's, making ever
larger amounts of data available on a world wide basis and placing greater
demands on the performance and availability of tertiary storage systems. In
this paper, we argue for a new approach to tertiary storage system
architecture that is obtained by coupling multiple small and inexpensive
`building block' libraries (or jukeboxes) together to create larger tertiary
storage systems. We call the resulting system a RAIL and show that it has
performance and availability characteristics superior to conventional
tertiary storage systems, for almost the same dollar/megabyte cost. A RAIL
system is the tertiary storage equivalent of a fixed magnetic disk RAID
storage system, but with several additional features that enable the ideas of
data striping and redundancy to function efficiently on dismountable media
and robotic media mounting systems. We present the architecture of such a
system called Starfish I and describe the implementation of a prototype. We
also introduce the idea of creating a log-structured library array (LSLA) on
top of a RAIL architecture (StarFish II) and show how it can have write
performance equivalent to that of secondary storage, and improved read
performance along with other advantages such as easier compression and the
elimination of the 4*RAID/RAIL write penalty.
Keywords: parallel I/O, redundant data, striping,
tertiary storage, pario-bib
Comment: Part of a special issue.
Keywords: parallel I/O, pario-bib
Comment: Part of jin:io-book; reformatted version
of nieplocha:arrays.
Keywords: computational science, chemistry,
parallel I/O, pario-bib
Comment: A library package for computational
chemistry programs. It supports out-of-core arrays. See also
nieplocha:chemio.
Keywords: parallel I/O, parallel database,
multiprocessor file system, climate model, grand challenge, tertiary storage,
archival storage, RAID, tape robot, pario-bib
Comment: Includes the slides from many presenters
covering climate modeling, data requirements for climate models, archival
storage systems, multiprocessor file systems, and so forth. NCAR data storage
growth rates (p. 54), 500 bytes per MFlop, or about 8 TB/year with Y/MP-8.
Average file length 26.2 MB. Migration across both storage hierarchy and
generations of media. LLNL researcher: typical 50-year, 3-dimensional model
with 5-degree resolution will produce 75 GB of output. Attendee list
included.
Abstract: As high-speed networks make it easier to
use distributed resources, it becomes increasingly common that applications
and their data are not colocated. Users have traditionally addressed this
problem by manually staging data to and from remote computers. We argue
instead for a remote I/O paradigm in which programs use familiar parallel I/O
interfaces to access remote filesystems. In addition to simplifying remote
execution, remote I/O can improve performance relative to staging by
overlapping computation and data transfer or by reducing communication
requirements. However, remote I/O also introduces new technical challenges in
the areas of portability, performance, and integration with distributed
computing systems. We propose techniques designed to address these challenges
and describe a remote I/O library called RIO that we are developing to
evaluate the effectiveness of these techniques. RIO addresses issues of
portability by adopting the quasi-standard MPI-IO interface and by defining a
RIO device and RIO server within the ADIO abstract I/O device architecture.
It addresses performance issues by providing traditional I/O optimizations
such as asynchronous operations and through implementation techniques such as
buffering and message forwarding to offload communication overheads.
Microbenchmarks and application experiments demonstrate that our techniques
can improve turnaround time relative to staging.
Keywords: parallel I/O, distributed file system,
pario-bib
Comment: They want to support users that have
datasets at different locations in the Internet, but need to access the data
at supercomputer parallel machines. Rather than staging data in and out, they
want to provide remote access. Issues: naming, dynamic loads, heterogeneity,
security, fault-tolerance. All traffic goes through a 'forwarder node' that
funnels all the traffic into the network. They use URLs for pathnames (e.g.,
"x-rio://..."). They find that non-blocking ops are important, as is
collective I/O. They think that buffering will be important. Limited
experiments.
Keywords: hypercube, pario-bib
Comment: See fox:cubix for parallel I/O.
Keywords: parallel file system, hypercube,
pario-bib
Comment: Parallel I/O control, called CUBIX.
Interesting method. Depends a lot on ``loose synchronization'', which is
sortof SIMD-like.
However, since the data cache was resident on a single
workstation, this approach did not address the issue of scalability of
methods for avoiding the data storage bottleneck by distributing the data
across multiple networked workstations. Scalability through distributed
database approaches is being investigated as part of the Applied
Visualization using Advanced Network Technology Infrastructure (AVANTI)
project. This paper describes a methodology currently under development
that is intended to avoid bottlenecks that typically arise as the result of
data consumers (e.g. visualization applications) that must access and process
large amounts of data that has been generated and resides on other hosts, and
which must pass through a central data cache prior to being used by the data
consumer. The methodology is based on a fundamental paradigm that the end
result (visualization) rendered by a data consumer can, in many cases, be
produced using a reduced data set that has been distilled or filtered from
the original data set. In the most basic case, the filtered data used as
input to the data consumer may simply be a proper subset of massive data sets
that have been distributed among hosts. For the general case, however, the
filtered data may bear no resemblance to the original data since it is the
result of processing the raw data set and distilling it to its visual
"essence", i.e. the minimal data set that is absolutely required by the data
consumer in order to perform the required rendering function. Data
distribution bottlenecks for visualization applications are thereby reduced
by avoiding the transfer of large amounts of raw data in favor of
considerably distilled visual data. There are, of course, computational
costs associated with this approach since raw data must be processed into its
visual essence, but these computational costs may be distributed among
multiple processors. It should be realized, however, that, in general, these
computational costs would exist any way since, for the visualization to be
performed, there must be a transformation between the raw data and the
visualization primitives (e.g. line segments, polygon vertices, etc.) to be
rendered. The main principal put forth by this paper is that if data
distribution bottlenecks are to be minimized, the amount of raw data
transferred should be reduced by employing data filtering processes that can
be distributed among multiple hosts. The complete paper demonstrates,
both analytically and experimentally, that this approach becomes increasingly
effective (scalable) as the computational expense associated with the data
filtering transformation rises. Abstract: Between 1994 and 1997, researchers at
Southwest Research Institute (SwRI) investigated methods for distributing
parallel computation and data visualization under the support of an
internally funded Research Initiative Program entitled the Advanced
Visualization Technology Project (AVTP). A hierarchical data cache
architecture was developed to provide a flexible interface between the
modeling or simulation computational processes and data visualization
programs. Compared to conventional post facto data visualization approaches,
this data cache structure provides many advantages including simultaneous
data access by multiple visualization clients, comparison of experimental and
simulated data, and visual analysis of computer simulation as computation
proceeds.
Keywords: distributed computing, filters, grid,
input/output, parallel I/O, pario-bib, app-pario
Comment: The goal of their work is to improve the
performance of data visualization applications which use remote the data
generators (disk or running application) and data consumers (the
visualization station) for visualzation applications. They deal with network
bottlenecks by using a distributed-redundant data cache to hold intermediate
data between the data generator and the data consumer. They also reduce
network traffic by applying data filters to the data at the distributed cache
processors. The main argument is that since the data must be filtered before
it is visualized, it makes more sense to perform the filter at the data cache
so the computation can be distributed and to reduce the amount of data that
needs to be transferred to the data consumer.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: The complete paper on the SPIFFI parallel
file system. It seems to be much like Intel CFS from the programmer's point
of view, with a few new file modes, user-selectable striping granularity.
Their Paragon, though a source of problems, had a disk on every node (though
they do not take advantage of that in this work). They have a buffer pool on
each I/O node, which does prefetching in a somewhat novel way.
Keywords: parallel file system, multimedia, video
server, pario-bib
Comment: See also freedman:spiffi. They simulate
their video-on-demand server. Their model is a cluster of workstation
servers, connected by a network to video-display terminals. The terminals
just have a circular buffer queue that they fill by making requests to the
server, and drain by uncompressing MPEG and displaying video. The servers
manage a buffer pool and a set of striped disks. All videos are striped
across all disks. They use dual LRU lists in the server buffer pool: one for
used blocks, and one for prefetched blocks (``love prefetching''). They use a
``real-time'' disk scheduling algorithm that prioritizes requests by their
deadlines (or anticipated deadline in case of a prefetch). Their metric is
maximum number of terminals that can be supported without glitches. They plan
to implement their system on a workstation cluster.
Keywords: interactive visualization,
multi-resolution visualization, adaptive visualization, scientific
application, parallel octrees, pario-bib
Comment: They describe a technique that combines
heirarchical data reduction methods with parallel computing to allow
"interactive exploration of large data sets while retaining full-resolution
capabilities." They point out that visualization of large data sets requires
a post-processing step to reduce the size, or sophisticated rendering
algorithms that work with the full resolution. There method combines the two
techniques.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: Proposes the min_SAR, max_SAR, and
ratio phi as measures of aggregate file system bandwidth. Has to do with load
balance issues; how well the file system balances between competing nodes in
a heavy-use period.
Keywords: parallel I/O, Intel iPSC/2, pario-bib
Keywords: parallel I/O, Intel iPSC/2, pario-bib
Keywords: parallel I/O, Intel iPSC/2, disk
caching, prefetching, pario-bib
Comment: Nice study of performance of existing CFS
system on 32-node + 4 I/O-node iPSC/2. They show big improvements due to
declustering, preallocation, caching, and prefetching. See also pratt:twofs.
Keywords: parallel I/O, pario-bib
Comment: They give a useful overview of the I/O
requirements of many applications codes, in terms of input, output, scratch
files, debugging, and checkpointing. They also describe their
architecture-independent I/O interface that provides calls to read and write
entire arrays, with some flexibility in the format and distribution of the
array. Curious centralized control method. Limited performance evaluation.
They're trying to keep the I/O media, file layout, and I/O architecture
transparent to the user. Implementation decides which processors actually do
read/write. Data formatted or unformatted; file sequential or parallel; can
specify distributed arrays with ghost points. Runs on lots of platforms; will
also be implementing on IBM SP-1 with disk per node, 128 nodes. Their package
is freely available via ftp. Future: buffer-size experiments, unstructured
data, use parallel file internally and then seqeuentialize on close.
Keywords: parallel I/O, pario-bib
Comment: Part of jin:io-book; reformatted version
of galbreath:applio.
Keywords: disk array, RAID, parallel I/O,
pario-bib, survey
Keywords: parallel I/O, disk striping, load
balancing, pario-bib
Comment: Using trace-driven simulation to compare
dynamic load-balancing techniques in databases that span several disk drives,
with the inherent load-balancing of striping. Their traces were from two
Oracle databases on two different NCR systems. They found that striping, with
its essentially random block-by-block load balancing, does a better job of
avoiding short-term load imbalances than the ``manual'' load-balancing does.
Abstract: This article describes an implementation
of MPI-IO using a new parallel file system, called Expand (Expandable
Parallel File System), which is based on NFS servers. Expand combines
multiple NFS servers to create a distributed partition where files are
striped. Expand requires no changes to the NFS server and uses RPC operations
to provide parallel access to the same file. Expand is also independent of
the clients, because all operations are implemented using RPC and NFS
protocols. Using this system, we can join heterogeneous servers (Linux,
Solaris, Windows 2000, etc.) to provide a parallel and distributed partition.
The article describes the design, implementation and evaluation of Expand
with MPI-IO. This evaluation has been made in Linux clusters and compares
Expand and PVFS.
Keywords: parallel file system, parallel I/O,
pario-bib
Keywords: parallel I/O, disk striping,
reliability, disk array, pario-bib
Comment: Reliability of striped filesystems may
not be as bad as you think. Parity disks help. Performance improvements
limited to small number of disks ($n<10$). Good point: efficiency of striping
will increase as the gap between CPU/memory performance and disk speed and
file size widens. Reliability may be better if measured in terms of
performing a task in time T, since the striped version may take less time.
This gives disks less opportunity to fail during that period. Also consider
the CPU failure mode, and its use over less time.
A highly efficient parallel file system has been
implemented on Intel's Teraflops (TFLOPS) machine and provides a sustained
I/O bandwidth of 1 GB/sec. This file system provides almost 95% of the
available raw hardware I/O bandwidth and the I/O bandwidth scales
proportional to the available I/O nodes. Intel's TFLOPS machine is the
first Accelerated Strategic Computing Initiative (ASCI) machine that DOE has
acquired. This computer is 10 times more powerful than the fastest machine
today, and will be used primarily to simulate nuclear testing and to ensure
the safety and effectiveness of the nation's nuclear weapons stockpile.
This machine contains over 9000 Intel's Pentium Pro processors, and will
provide a peak CPU performance of 1.8 teraflops. This papers presents the I/O
design and architecture of Intel's TFLOPS supercomputer, describes the Cougar
OS I/O and its interface with the Intel's Parallel File System. Abstract: In recent years, many commercial
Massively Parallel Processor (MPP) systems have been available to the
computing community. These systems provide very high processing power (up to
hundreds of GFLOPs), and can scale efficiently with the number of processors.
However, many scientific and commercial applications that run on these
multiprocessors may not experience significant benefit in terms of speedup
and are bottlenecked by their I/O requirements. Although these
multiprocessors may be configured with sufficient I/O hardware, the file
system software often fails to provide the available I/O bandwidth to the
application, and causes severe performance degradation for I/O intensive
applications.
Keywords: parallel file system, intel, ASCI Red,
pario-bib
Comment: Describes the parallel file system for
ASCI Red. The paper is only available as HTML
Abstract: Bulk Synchronous Parallel ML or BSML is
a functional data-parallel language for programming bulk synchronous parallel
(BSP) algorithms. The execution time can be estimated and dead-locks and
indeterminism are avoided. For large scale applications where parallel
processing is helpful and where the total amount of data often exceeds the
total main memory available, parallel disk I/O becomes a necessity. We
present here a library of I/O features for BSML and its cost model.
Keywords: parallel I/O, parallel ML, BSML, data
parallel language, pario-bib
Keywords: parallel I/O, multimedia, pario-bib
Comment: Part of jin:io-book; reformatted version
of gennart:comparing.
Abstract: Multimedia interfaces increase the need
for large image databases, capable of storing and reading streams of data
with strict synchronicity and isochronicity requirements. In order to fulfil
these requirements, we use a parallel image server architecture which relies
on arrays of intelligent disk nodes, each disk node being composed of one
processor and one or more disks. This contribution analyzes through
simulation the real-time behavior of two multiprocessor multi-disk
architectures: GigaView and the Unix workstation cluster. GigaView
incorporates point-to-point communication between processing units and the
workstation cluster supports communication through a shared bus-and-memory
architecture. For a standard multimedia server architecture consisting of 8
disks and 4 disk-node processors, we evaluate stream frame access times under
various parameters such as load factors, frame size, stream throughput and
synchronicity requirements. We compare the behavior of GigaView and the
workstation cluster in terms of delay and delay jitter
Keywords: parallel I/O, multimedia, pario-bib
Keywords: parallel I/O, IBM SP2, pario-bib
Keywords: multimedia, parallel I/O, pario-bib
Comment: Part of jin:io-book; reformatted version
of ghandeharizadeh:mitra.
In
addition to reporting on implementation details of Mitra, we present
performance results that demonstrate the scalability characteristics of the
system. We compare the obtained results with theoretical expectations based
on the bandwidth of participating disks. Mitra attains between 65% to 100% of
the theoretical expectations. Abstract: Mitra is a scalable storage manager that
supports the display of continuous media data types, e.g., audio and video
clips. It is a software based system that employs off-the-shelf hardware
components. Its present hardware platform is a cluster of multi-disk
workstations, connected using an ATM switch. Mitra supports the display of a
mix of media types. To reduce the cost of storage, it supports a hierarchical
organization of storage devices and stages the frequently accessed objects on
the magnetic disks. For the number of displays to scale as a function of
additional disks, Mitra employs staggered striping. It implements three
strategies to maximize the number of simultaneous displays supported by each
disk. First, the EVEREST file system allows different files (corresponding to
objects of different media types) to be retrieved at different block size
granularities. Second, the FIXB algorithm recognizes the different zones of a
disk and guarantees a continuous display while harnessing the average disk
transfer rate. Third, Mitra implements the Grouped Sweeping Scheme (GSS) to
minimize the impact of disk seeks on the available disk bandwidth.
Keywords: multimedia, parallel I/O, pario-bib
Comment: This paper describes the continous media
server Mita. Mita runs on a cluster of multi-disk HP 9000/735 workstations.
Each workstation consists of 80 Mbytes of memory and four disks. They
implement ''staggered striping'' of the data in which disks are clustered
based on media type and treated as a single logical unit. Data is then
striped across the logical disk cluster in a round-robin fashion. They
present performance results as a function of total number of disks and the
number of disks in a cluster.
Keywords: parallel I/O, multimedia, pario-bib
Comment: Part of a special issue.
The file system has successfully met
our storage needs. It is widely deployed within Google as the storage
platform for the generation and processing of data used by our service as
well as research and development efforts that require large data sets. The
largest cluster to date provides hundreds of terabytes of storage across
thousands of disks on over a thousand machines, and it is concurrently
accessed by hundreds of clients. In this paper, we present file system
interface extensions designed to support distributed applications, discuss
many aspects of our design, and report measurements from both
micro-benchmarks and real world use. Abstract: We have designed and implemented the
Google File System, a scalable distributed file system for large distributed
data-intensive applications. It provides fault tolerance while running on
inexpensive commodity hardware, and it delivers high aggregate performance to
a large number of clients. While sharing many of the same goals as previous
distributed file systems, our design has been driven by observations of our
application workloads and technological environment, both current and
anticipated, that reflect a marked departure from some earlier file system
assumptions. This has led us to re-examine traditional choices and explore
radically different design points.
Keywords: distributed file system, pario-bib
Keywords: parallel I/O, MIMD, multiprocessor
architecture, hypercube, pario-bib
Comment: Given a hypercube that has I/O nodes
scattered throughout, they compare a plain one to one that has the I/O nodes
also interconnected with a half-size hypercube. They show that this has
better performance because the I/O traffic does not interfere with normal
inter-PE traffic. See also ghosh:pario.
Keywords: multiprocessor file system, parallel
I/O, pario-bib
Comment: They simulate a 128-node hypercube with
16 I/O nodes attached at uniformly distributed points. They compare two
architectures: one with a separate I/O network, and another without a
separate I/O network. When there, the extra network is used to route I/O
packets from the originating I/O node to the I/O node closest to the
destination processing node (or vice versa). They run simulations under
workloads with differing amounts of locality, and experiment with different
bandwidths for the links. They conclude that the extra network helps. But
they never make the (proper, fair) comparison where the total network
bandwidth is held constant. See also ghosh:hyper.
Keywords: parallel I/O, RAID, redundancy,
reliability, pario-bib
Keywords: parallel I/O, disk array, disk striping,
reliability, RAID, pario-bib
Comment: Excellent book. Good source for
discussion of the access gap and transfer gap, disk lifetimes, parity
methods, reliability analysis, and generally the case for RAIDs. Page 220 he
briefly discusses multiprocessor I/O architecture.
Keywords: network-attached storage, storage
architecture, parallel I/O, pario-bib
Comment: Part of jin:io-book; reformatted version
of gibson:storage.
Abstract: The most difficult and frequently most
important challenge for high performance file access is the achievement of
low latency cache misses. We propose to explore the utility and feasibility
of using file access hints to schedule overlapped prefetching of file data.
Hints may be issued explicitly by programmers, automatically by compilers,
speculatively by parent tasks such as shells and makes, or historically by
previously profiled executions. Our research will also address the thorny
issues of hint specification, memory resource management, imprecise and
incorrect hints, and appropriate interfaces for propogating hints through and
to effected application, operating system, file system, and device specific
modules. We begin our research with a detailed examination of two
applications with large potential for improvement: compilation of multiple
module software systems and scientific simulation using very large grid state
files.
Keywords: file system, prefetching, pario-bib
Comment: A relatively early TIP report with
nothing really new over patterson:tip.
Abstract: The ever increasing need for I/O
bandwidth will be met with ever larger arrays of disks. These arrays require
redundancy to protect against data loss. This paper examines alternative
choices for encodings, or codes, that reliably store information in disk
arrays. Codes are selected to maximize mean time to data loss or minimize
disks containing redundant data, but are all constrained to minimize
performance penalties associated with updating information or recovering from
catastrophic disk failures. We also codes that give highly reliable data
storage with low redundant data overhead for arrays of 1000 information
disks.
Keywords: parallel I/O, disk array, RAID,
reliability, pario-bib
Comment: See gibson:raid for comments since it is
the same.
Abstract: By providing direct data transfer
between storage and client, network-attached storage devices have the
potential to improve scalability for existing distributed file systems (by
removing the server as a bottleneck) and bandwidth for new parallel and
distributed file systems (through network striping and more efficient data
paths). Together, these advantages influence a large enough fraction of the
storage market to make commodity network-attached storage feasible. Realizing
the technology's full potential requires careful consideration across a wide
range of file system, networking and security issues. This paper contrasts
two network-attached storage architectures-(1) Networked SCSI disks (NetSCSI)
are network attached storage devices with minimal changes from the familiar
SCSI interface, while (2) Network-Attached Secure Disks (NASD) are drives
that support independent client access to drive object services. To estimate
the potential performance benefits of these architectures, we develop an
analytic model and perform trace-driven replay experiments based on AFS and
NFS traces. Our results suggest that NetSCSI can reduce file server load
during a burst of NFS or AFS activity by about 30%. With the NASD
architecture, server load (during burst activity) can be reduced by a factor
of up to five for AFS and up to ten for NFS.
Keywords: NASD, network-attached disks,
distributed file system, parallel file system, security, secure disks,
pario-bib
Comment: Essentially, the conference (and
subsequent) journal version of gibson:nasd-tr. The studies that use simple
analytical models (based on measured workloads of NFS and AFS file managers)
to compare performance of NASD to SAD (storage-attached disks) and NetSCSI
are often cited as justification for the NASD and object-based storage
approaches.
Using AFS and NFS, trace
results suggest that NetSCSI can reduce file server load during a burst of
AFS activity by a factor of about 2; for the NASD architecture, server load
(during burst activity) can be reduced by a factor of about 4 for AFS and 10
for NFS. Abstract: By providing direct data transfer
between storage and client, network-attached storage devices have the
potential to improve scalability (by removing the server as a bottleneck) and
performance (through network striping and shorter data paths). Realizing the
technology's full potential requires careful consideration across a wide
range of file system, networking and security issues. To address these
issues, this paper presents two new network-attached storage architectures.
(1) Networked SCSI disks (NetSCSI) are network-attached storage devices with
minimal changes from the familiar SCSI interface (2) Network-attached secure
disks (NASD) are drives that support independent client access to drive
provided object services. For both architectures, we present a sketch of
repartitionings of distributed file system functionality, including a
security framework whose strongest levels use tamper resistant processing in
the disks to provide action authorization and data privacy even when the
drive is in a physically insecure location.
Keywords: parallel I/O, network attached storage,
distributed file systems, computer security, network attached secure disks,
NASD, capability system, pario-bib
Comment: They outline their rationale for the idea
of Network-attached Secure Disks (NASD). Basically the idea is to develop
disk drives that attach right to the LAN, rather than to a file server, and
allow clients to access the disks directly for many of the simpler file
system actions (read and write file data, read file attributes), and only
contact the server for more complex activities (opening and creating files,
changing attributes). This removes the load from file servers, which are
getting too slow to move large amounts of data needed by large installations.
Issues include security, of course, which they solve with encryption (for
privacy) and time-limited capabilities (keys) given out by the server to
authenticated clients, which the clients show to the disk to gain access.
They compare the performance of NASD, using a simple analytical model and
parameters obtained from measuring real NFS and AFS implementations, to the
performance of SAD (server-attached disks) and NetSCSI (a hybrid approach
that involves the server in every operation but allows data to flow directly
from disk to and from the network).
Keywords: parallel I/O, RAID, reliability, disk
array, pario-bib
Comment: Design of parity encodings to handle more
than one bit failure in any group. Their 2-bit correcting codes are good
enough for 1000-disk RAIDs that 3-bit correction is not needed.
Keywords: parallel I/O, RAID, disk array,
reliability, simulation, pario-bib
Comment: Short version appeared as
courtright:raidframe. Pretty neat idea. They provide a way to express the
sequence of disk-access operations in a RAID controller using directed
acyclic graphs, and a library that can `execute' these graphs either in a
simulation or in an software-RAID implementation. The big benefit is that it
is faster, easier, and less error-prone to implement various RAID management
policies.
Keywords: parallel I/O, RAID, disk array,
multiprocessor file system, file prefetching, file caching, cache
consistency, pario-bib
Keywords: parallel I/O, RAID, disk array,
multiprocessor file system, file prefetching, file caching, cache
consistency, pario-bib
Comment: An overview of research being done in
Garth's group. Touches on work in RAID disk arrays, parallel file systems,
and prefetching. I think gibson:scotch-tr is nearly the same.
Abstract: We discuss the strategic directions and
challenges in the management and use of storage systems-those components of
computer systems responsible for the storage and retrieval of data. The
performance gap between main and secondary memories shows no imminent sign of
vanishing, and thus continuing research into storage I/O will be essential to
reap the full benefit from the advances occurring in many other areas of
computer science. In this report we identify a few strategic research goals
and possible thrusts to meet those goals.
Keywords: supercomputing, data storage, database,
parallel I/O, pario-bib
Comment: A more reliable, but limited-access, URL
is http://www.acm.org/pubs/citations/journals/surveys/1996-28-4/p779-gibson/
Keywords: network-attached storage, storage
architecture, parallel I/O, pario-bib
Abstract: Storage systems are continuing to grow,
and they are become shared resources with the advent of I/O networks like
FibreChannel. Managing these resources to meet performance and resiliency
goals is becoming a significant challenge. We believe that completely
automatic, attribute-managed storage is the way to address this issue. Our
approach is based on declarative specifications of both application workloads
and device characteristics. These are combined by a matching engine to
generate a load-assignment that provides optimal performance and meets
availability guarantees, at minimum cost.
Keywords: I/O architecture, disk array, RAID, file
system, storage system, pario-bib
Comment: This is just a 4-page position paper. See
also shriver:slides.
Keywords: video server, multimedia, parallel I/O,
pario-bib
Comment: An approach called adaptive piggybacking
groups together streams that are watching the same video, but at slightly
different times, so that they can share the I/O streams.
Abstract: In recent years advances in
computational speed have been the main focus of research and development in
high performance computing. In comparison, the improvement in I/O performance
has been modest. Faster processing speeds have created a need for faster I/O
as well as for the storage and retrieval of vast amounts of data. The
technology needed to develop these mass storage systems exists today. Robotic
storage libraries are vital components of such systems. However, they
normally exhibit high latency and long transmission times. We analyze the
performance of robotic storage libraries and study striping as a technique
for improving response time. Although striping has been extensively studied
in the content of disk arrays, the architectural differences between robotic
storage libraries and arrays of disks suggest that a separate study of
striping techniques in such libraries would be beneficial.
Keywords: mass storage, parallel I/O, pario-bib
Keywords: parallel I/O, multimedia, survey,
pario-bib
Comment: Part of a special issue.
Our results are built on four fundamental
techniques: distribution sweeping, a generic method for externalizing
plane-sweep algorithms; persistent B-trees, for which we have both
on-line and off-line methods; batch filtering, a general method for
performing $K$ simultaneous external-memory searches in any data structure
that can be modeled as a planar layered dag; and external
marriage-before-conquest, an external-memory analog of the well-known
technique of Kirkpatrick and Seidel. Using these techniques we are able to
solve a very large number of problems in computational geometry, including
batched range queries, 2-d and 3-d convex hull construction, planar point
location, range queries, finding all nearest neighbors for a set of planar
points, rectangle intersection/union reporting, computing the visibility of
segments from a point, performing ray-shooting queries in constructive solid
geometry (CSG) models, as well as several geometric dominance problems.
These results are significant because large-scale problems involving
geometric data are ubiquitous in spatial databases, geographic information
systems (GIS), constraint logic programming, object oriented databases,
statistics, virtual reality systems, and graphics. This work makes a big
step, both theoretically and in practice, towards the effective management
and manipulation of geometric data in external memory, which is an essential
component of these applications. Abstract: In this paper, we give new techniques
for designing efficient algorithms for computational geometry problems that
are too large to be solved in internal memory, and we use these techniques to
develop optimal and practical algorithms for a number of important
large-scale problems in computational geometry. Our algorithms are optimal
for a wide range of two-level and hierarchical multilevel memory models,
including parallel models. The algorithms are optimal in terms of both I/O
cost and internal computation.
Keywords: computational geometry, parallel I/O
algorithm, pario-bib
Keywords: parallel I/O, RAID, disk array,
pario-bib
Keywords: parallel I/O, pario-bib
Keywords: parallel I/O, object-oriented,
distributed data structures, runtime library, pario-bib
Comment: URL is for tech report version. They have
a language called pC++ that allows object-parallel programming. They have a
library called d/streams for I/O of distributed arrays. pC++/streams is a
combination. You open a file, specify the in-memory distribution, read from
the stream, and then extract some variables. Likewise, you insert some
variables (into the stream buffer), then write it. They manage the
distribution, and they store necessary metadata to reassemble the data
structure when reading. Variables can be arbitrary classes, with $>>$ and
$<<$ overloaded as the insert and extract operators. Performance is
reasonable on Intel Paragon and SGI Challenge.
Abstract: Technology trends promise to give us
processors with pico-second clock speeds. These pico-processors will spend
much of their time waiting for information from the storage hierarchy. I
believe this will force us to adopt a data-flow programming model. Similar
trends will bless us with peta-byte online stores with exa-byte near-line
stores. One large disk manufacture claims it costs 8$/year to manage a
megabyte of online storage. That is 8 Billion dollars per year to manage a
petabyte. Automating storage management is one of our major challenges. This
talk covers these technology trends, surveys the current status of commercial
software tools (aka database systems), their peak performance and price
performance. It then poses four major challenges: total-cost-of ownership,
long-term archiving, reliably storing exabytes, and data mining on petabyte
databases.
Keywords: parallel computing, computer
architecture, parallel I/O, pario-bib, memory hierarchy, distributed
computing, database, object oriented
Comment: Very interesting talk. URL points to
PowerPoint slides.
Keywords: disk striping, reliability, pario-bib
Comment: Parity striping, a variation of RAID 5,
is just a different way of mapping blocks to disks. It groups parity blocks
into extents, and does not stripe the data blocks. A logical disk is mostly
contained in one physical disk, plus a parity region in another disk. Good
for transaction processing workloads. Has the low cost/GByte of RAID, the
reliability of RAID, without the high transfer rate of RAID, but with much
better requests/second throughput than RAID 5. (But 40% worse than mirrors.)
So it is a compromise between RAID and mirrors. BUT, see mourad:raid.
Keywords: parallel I/O, parallel file system,
object-oriented, file system interface, Intel iPSC/2, pario-bib
Comment: See also grimshaw:elfs. They hope to
provide high bandwidth and low latency, reduce the cognitive burden on the
programmer, and manage proliferation of data formats and architectural
changes. Details of the plan to make an extensible OO interface to file
system. Objects each have a separate thread of control, so they can do
asynchronous activity like prefetching and caching in the background, and
support multiple outstanding requests. The Mentat object system makes it easy
for them to support pipelining of I/O with I/O and computation in the user
program. Let the user choose type of consistency needed. See grimshaw:objects
for more results.
Keywords: parallel I/O, parallel file system,
object-oriented, file system interface, pario-bib
Comment: Full paper grimshaw:ELFSTR. Really neat
idea. Uses OO interface to file system, which is mostly in user mode. The
object classes represent particular access patterns (e.g., 2-D matrix) in the
file, and hide the actual structure of the file. The object knows enough to
taylor the cache and prefetch algorithms to the semantics. Class inheritance
allows layering.
Keywords: parallel I/O, multiprocessor file
system, file system interface, pario-bib
Comment: Not much new from ELFS TR. A better
citation than grimshaw:ELFS though. Does give CFS performance results. Note
on 721 he says that CFS prefetches into ``local memory from which to satisfy
future user requests that never come.'' This happens if the local
access pattern isn't purely sequential, as in an interleaved pattern.
Abstract: Highly parallel applications often use
either highly parallel file systems or large numbers of independent disks.
Either approach can provide the high data rates necessary for parallel
applications. However, the failure of a single disk or server can render the
data useless. Conventional techniques, such as those based on applying
erasure correcting codes to each file write, are prohibitively expensive for
massively parallel scientific applications because of the granularity of
access at which the codes are applied. In this paper we demonstrate a
scalable method for recovering from single disk failures that is optimized
for typical scientific data sets. This approach exploits coarser-grained (but
precise) semantics to reduce the overhead of constructing recovery data and
makes use of parallel computation (proportional to the data size and
independent of number of processors) to construct data. Experiments are
presented showing the efficiency of this approach on a cluster with
independent disks, and a technique is described for hiding the creation of
redundant data within the MPI-IO implementation.
Keywords: fault-tolerance, single-disk failures,
MPI-IO, pario-bib
Using MPI is a completely up-to-date version of
the authors' 1994 introduction to the core functions of MPI. It adds material
on the new C++ and Fortran 90 bindings for MPI throughout the book. It
contains greater discussion of datatype extents, the most frequently
misunderstood feature of MPI-1, as well as material on the new extensions to
basic MPI functionality added by the MPI-2 Forum in the area of MPI datatypes
and collective operations. Using MPI-2 covers the new extensions to basic
MPI. These include parallel I/O, remote memory access operations, and dynamic
process management. The volume also includes material on tuning MPI
applications for high performance on modern MPI implementations.
Abstract: The Message Passing Interface (MPI)
specification is widely used for solving significant scientific and
engineering problems on parallel computers. There exist more than a dozen
implementations on computer platforms ranging from IBM SP-2 supercomputers to
clusters of PCs running Windows NT or Linux ("Beowulf" machines). The initial
MPI Standard document, MPI-1, was recently updated by the MPI Forum. The new
version, MPI-2, contains both significant enhancements to the existing MPI
core and new features.
Keywords: parallel computing, message passing,
parallel I/O, multiprocessor file system interface, pario-bib
Comment: Has a large chapter on MPI-IO with lots
of example programs.
Keywords: parallel I/O, parallel architecture,
networking, pario-bib
Comment: They examine the characteristics of a
system that has I/O nodes which interface between the internal
interconnection network of a distributed-memory MIMD machine and some
external network, such as HIPPI. They build a simple model to show how
different components affect the I/O throughput. They show the performance of
their iWarp-HIPPI interface. They conclude that the I/O nodes must have
sufficient memory bandwidth to support multiple data streams coming from
several compute nodes, being combined into a single faster external network,
or vice versa. They need to support scatter/gather, because the data is often
distributed in small pieces. For the same reason, they need to have low
per-message overhead. The internal network routing must allow multiple paths
between compute nodes and the I/O nodes, to avoid congestion.
Keywords: out-of-core algorithm, data structure,
pario-bib
Comment: See also the component papers
vitter:survey, arge:lower, crauser:segment, grossi:crosstrees, toledo:survey.
Not clear to what extent these papers are about *parallel* I/O.
Keywords: mass storage, parallel I/O, pario-bib
Comment: Part of jin:io-book; reformatted version
of grossman:library.
Abstract: Requirements for a high-performance,
scalable digital library of multimedia data are presented together with a
layered architecture for a system that addresses the requirements. The
approach is to view digital data as persistent collections of complex objects
and to use lightweight object management to manage this data. To scale as the
amount of data increases, the object management component is layered over a
storage management component. The storage management component supports
hierarchical storage, third-party data transfer and parallel input-output.
Several issues that arise from the interface between the storage management
and object management components are discussed. The authors have developed a
prototype of a digital library using this design. Two key components of the
prototype are AIM Net and HPSS. AIM Net is a persistent object manager and is
a product of Oak Park Research. HPSS is the High Performance Storage System,
developed by a collaboration including IBM Government Systems and several
national labs.
Keywords: mass storage, parallel I/O, pario-bib
Abstract: This paper presents a framework for
synthesizing efficient out-of-core programs for block recursive algorithms
such as the fast Fourier transform (FFT) and Batcher's bitonic sort. The
block recursive algorithms conside red in this paper are described using
tensor (Kronecker) product and other matrix operations. The algebraic
properties of the matrix representation are used to derive efficient
out-of-core programs. These programs are targeted towards a two-level disk
model which allows HPF supported cyclic(B) data distribution on a disk array.
The effectiveness of our approach is demonstrated through an example
out-of-core FFT program implemented on a work-station.
Keywords: parallel I/O algorithm, pario-bib
Keywords: parallel computing, scientific
computing, weather prediction, global climate model, parallel I/O, pario-bib
Comment: There is some discussion of I/O issues.
This weather code does some out-of-core work, to communicate data between
time steps. They also dump a 'history' file every simulated day, and periodic
checkpoint files. They are flexible about the layout of the history file,
assuming postprocessing will clean it up. The I/O is not too much trouble on
the Cray C90, where they get 350 MBps to the SSD for the out-of-core data.
The history I/O is no problem. On distributed-memory machines with no SSD,
out-of-core was impractical and the history file was only written once per
simulated month. 'The most significant weakness in the distributed-memory
implementation is the treatment of I/O, [due to] file system maturity....'
See hammond:atmosphere and jones:skyhi in the same issue.
Keywords: network congestion, parallel tcp
streams, transport protocols, pario-bib
Keywords: network congestion, parallel tcp
streams, fairness, transport protocols, pario-bib
Comment: Also see earlier hacker:parallel-tcp and
hacker:effects
Abstract: This paper examines the effects of using
parallel TCP flows to improve end-to-end network performance for distributed
data intensive applications. A series of transmission experiments were
conducted over a wide-area network to assess how parallel flows improve
throughput, and to understand the number of flows necessary to improve
throughput while avoiding congestion. An empirical throughput expression for
parallel flows based on experimental data is presented, and guidelines for
the use of parallel flows are discussed. (45 refs.)
Keywords: network congestion, parallel tcp
streams, transport protocols, pario-bib
Keywords: hypercube, multiprocessor file system,
pario-bib
Comment: An early paper describing a proposed file
system for hypercubes. The writing is almost impenetrable. Confusing and not
at all clear what they propose. See also hadimioglu:hyperfs and
flynn:hyper-fs.
Keywords: multiprocessor file system, parallel
I/O, hypercube, pario-bib
Comment: Describes a hypercube file system based
on I/O nodes and processor nodes. A few results from a hypercube simulator.
See hadimioglu:fs and flynn:hyper-fs.
Keywords: parallel computing, scientific
computing, weather prediction, global climate model, parallel I/O, pario-bib
Comment: They discuss a weather code that runs on
the CM-5. The code writes a history file, dumping some data every timestep,
and periodically a restart file. They found that CM-5 Fortran met their
needs, although required huge buffers to get much scalability. They want to
see a single, shared file-system image from all processors, have the file
format be independent of processor count, use portable conventional
interface, and have throughput scale with the number of computation
processors. See also hack:ncar and jones:skyhi in the same issue.
Keywords: parallel I/O, parallel file system,
heterogeneous, pario-bib
Comment: See delrosario:vipfs-tr for an earlier
version. Also appears as NPAC report SCCS-686.
Abstract: In order to attain portability when
using message passing on a distributed memory system, a portable message
passing system must be used as well as other portable system support
services. MPI[1] addresses the message passing problem. To date, there are no
standards for system services and I/O. A library developed at NOAA's Forecast
Systems Laboratory (FSL) known as the Nearest Neighbor Tool[2] (NNT) provides
a high level portable interface to interprocess communications for finite
difference approximation numerical weather prediction (NWP) models. In order
to achieve portability, MPI is used to support interprocess communications.
The other services are provided by the lower level library developed at
NOAA/FSL known as the Scalable Runtime System (SRS). The principle focus of
this paper is SRS.
Keywords: parallel I/O, runtime library, pario-bib
Comment: They describe the runtime system that
supports the Nearest-Neighbor Tool (NNT), which they use to parallelize
weather-prediction codes. This paper gives a vague overview of the I/O
support. The interface sounds fairly typical, as does the underlying
structure (server processes, cache processes, etc). Sounds like it is in its
early stages, but is useful for many applications.
Keywords: parallel I/O, distributed file system,
disk striping, pario-bib
Comment: Part of jin:io-book; reformatted version
of hartman:zebra3.
Keywords: disk striping, distributed file system,
pario-bib
Comment: Not a parallel file system, but worth
comparing to Swift. Certainly, a similar idea could be used in a
multiprocessor. Cite hartman:zebra3.
Keywords: file system, disk striping, distributed
file system, RAID, log-structured file system, parallel I/O, pario-bib
Comment: Zebra stripes across network servers, but
not on a file-by-file basis. Instead they use LFS ideas to stripe a
per-client log across all file servers. Each client can then compute a parity
block for each strip that it writes. They store ``deltas'', changes in block
locations, in with the data, and also send them to the (central) file
manager. The file manager and stripe cleaner are key state managers, that
keep track of where blocks are located, and of stripe utilizations.
Performance numbers limited to small-scale tests. This paper has more details
than hartman:zebra, and performance numbers (but not with real workloads or
stripe cleaner). Some tricky consistency issues.
Keywords: parallel I/O, distributed file system,
disk striping, pario-bib
Keywords: parallel I/O, Linda, data parallel,
nCUBE, parallel graphics, heterogeneous computing, pario-bib
Comment: C*-Linda is basically a combination of C*
and C-Linda. The model is that of several SIMD modules interacting in a MIMD
fashion through a Linda tuple space. The modules are created using
eval, as in Linda. In this case, the compiler statically assigns each eval
to a separate subcube on an nCUBE 3200, although they also talk about
multiprogramming several modules on a subcube (not supported by VERTEX). They
envision having separate modules running on the nCUBE's graphics processors,
or having the file system directly talk to the tuple space, to support I/O.
They also envision talking to modules elsewhere on a network, e.g., a
workstation, through the tuple space. They reject the idea of sharing memory
between modules due to the lack of synchrony between modules, and message
passing because it is error-prone.
Keywords: hypercube, parallel architecture, nCUBE,
pario-bib
Comment: Description of the first nCUBE, the
NCUBE/ten. Good historical background about hypercubes. Talks about their
design choices. Says a little about the file system - basically just a way
of mounting disks on top of each other, within the nCUBE and to other nCUBEs.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: An overview of the issues in designing a
parallel file system, along with some early ideas for their own file system.
They aim for a general-purpose system, and characterize the workload into
three classes: independent, much like a timesharing system;
cooperative-agents, like that expected by most current MIMD file systems; and
single-agent, for data-parallel programs where a ``master'' process issues
single large requests on behalf of many processes. Their design is heavily
weighted to the assumption of shared memory, and in particular to a
randomized shared memory (like RP3), so they don't worry about locality much.
They say little about their interface, although they intend to stick to a
Unix interface - and Unix semantics - as much as possible. The file
system is essentially represented by a collection of shared data structures
and many threads to manipulate those structures.
Abstract: Distributed-memory systems have
traditionally had great difficulty performing network I/O at rates
proportional to their computational power. The problem is that the network
interface has to support network I/O for a supercomputer, using computational
and memory bandwidth resources similar to those of a workstation. As a
result, the network interface becomes a bottleneck. We implemented an
architecture for network I/O for the iWarp system with the following two key
characteristics: first, application-specific tasks are off-loaded from the
network interface to the distributed-memory system, and second, these tasks
are performed in close cooperation with the application. The network
interface has been used by several applications for over a year. In this
paper we describe the network interface software that manages the
communication between the iWarp distributed-memory system and the network
interface, we validate the main features of our network interface
architecture based on application experience, and we discuss how this
architecture can be used by other distributed-memory systems.
Keywords: parallel network I/O, pario-bib
Comment: Parallel network I/O on the iWARP. Note
proceedings only on CD-ROM and WWW.
Keywords: parallel I/O, pario-bib
Comment: Scalable I/O initiative intends to build
a testbed. At Argonne, they have a 128-node SP-1 with a high-speed switch. 96
are compute nodes, 32 are I/O nodes (128 MB RAM, 1 GB local disk,
FibreChannel port). FibreChannel connects to RS/6000 which has 256 MB RAM,
two 80 MB/s busses, and a HIPPI interface to a 220 GB RAID (level 1 or 5) and
6.4 TB tape robot. They run UniTree on all this. They use multiple files to
get parallelism. FibreChannel with TCP/IP is the limiting factor. note they
are focussing more on the external connectivity issues rather than on the
internal file system.
Keywords: parallel I/O, disk media, optical disk,
holographic storage, trends, tape storage, parallel transfer disk, disk
striping, pario-bib
Comment: A good overview of the current
state-of-the art in March 1991, including particular numbers and vendor
names. They discuss disk media (density, rotation, etc.), parallel transfer
disks, disk arrays, parity and RAID, HiPPI, tape archives, optical memory,
and holographic storage. Rotation speeds can increase as diameter goes down.
Density increases are often offset by slower head settling times. Disk arrays
will hit their ``heydey'' in the 1990s. Trend toward network-attached storage
devices, that don't need a computer as a server.
Keywords: parallel I/O, virtual memory, SIMD,
multiprocessor file system, pario-bib
Comment: He has an MPL (Maspar C) preprocessor
that inserts code to allow you to make plural vectors and arrays pageable.
The preprocessor inserts checks before every access to see whether you have
that data in memory, and if not, to page it in. The preprocessor is supported
by a run-time library. No compiler, OS, or hardware mods.
Keywords: parallel I/O, file system, pario-bib
Comment: Ways to arrange 2-d images on disk arrays
that have multiple processors (like Datamesh), so that retrieval time for
images or subimages is minimized.
Keywords: parallel I/O benchmarks, MPI-IO,
pario-app, pario-bib
Abstract: In this paper, we present a mechanism
able to predict the performance a given workload will achieve when running on
a given storage device. This mechanism is composed by two modules. The first
one is able to reproduce its behavior later on, without a new execution, even
when the storage drives or data placement are modified. The second module is
a drive modeler that is able to learn how storage drive works in an automatic
way, just executing some synthetic tests. Once we have the workload and drive
models, we can predict how well that application will perform on the selected
storage device or devices or when the data placement is modified. The results
presented in this paper will show that this prediction system achieves errors
below 10% when compared to the real performance obtained. It is important to
notice that the two modules will treat both the application and the storage
device as black and will need no previous information about them. (20 refs.)
Keywords: performance prediction, data placement,
storage device modeling, parallel I/O, pario-bib
Comment: Could not find a URL. See
Keywords: parallel architecture, SIMD, MIMD,
parallel I/O, pario-bib
Comment: A good basic citation for the CM-5
architecture. A little bit about I/O.
Keywords: parallel database, concurrency control,
deadlock, parallel I/O, pario-bib
Comment: Most interesting to me in this paper is
their discussion of the ``container model,'' in which they claim they allow
the processors to be driven by the I/O devices. What it boils down to is a
producer-consumer queue of containers, each of which contains a task (some
tuples and presumably some instruction about what to do with them). The disks
put data into containers and stick them on the queue; the processors
repeatedly pull containers (tasks) from the queue and process them. They
don't describe the activity of the disks in much detail. See kitsuregawa:sdc.
Abstract: Recently, a new server-less architecture
is proposed for building low-cost yet scalable video streaming systems.
Compare to conventional client-server-based video streaming systems, this
server-less architecture does not need any dedicated video server and yet is
highly scalable. Video data are distributed among user hosts and these hosts
cooperate to stream video data to one another. Thus as new hosts join the
system, they also add streaming and storage capacity to absorb the added
streaming load. This study investigates the data reorganization problem when
growing a server-less video streaming system. Specifically, as video data are
distributed among user hosts, these data will need to be redistributed to
newly joined hosts to utilize their storage and streaming capacity. This
study presents a new data reorganization algorithm that allows controllable
tradeoff between data reorganization overhead and streaming load balance.
Keywords: data reorganization, video on demand,
video streaming, pario-bib
We first
address the generalized parity layout problem, basing our solution on
balanced incomplete and complete block designs. A software implementation of
declustering is then evaluated using a disk array simulator under a highly
concurrent workload comprised of small user accesses. We show that
declustered parity penalizes user response time while a disk is being
repaired (before and during its recovery) less than comparable
non-declustered (RAID5) organizations without any penalty to user response
time in the fault-free state. We then show that previously proposed
modifications to a simple, single-sweep reconstruction algorithm further
decrease user response times during recovery, but, contrary to previous
suggestions, the inclusion of these modifications may, for many
configurations, also slow the reconstruction process. This result arises from
the simple model of disk access performance used in previous work, which did
not consider throughput variations due to positioning delays. Abstract: We describe and evaluate a strategy for
declustering the parity encoding in a redundant disk array. This declustered
parity organization balances cost against data reliability and performance
during failure recovery. It is targeted at highly-available parity-based
arrays for use in continuous- operation systems. It improves on standard
parity organizations by reducing the additional load on surviving disks
during the reconstruction of a failed disk's contents. This yields higher
user throughput during recovery, and/or shorter recovery time.
Keywords: parity, declustering, striping, disk
array, redundancy, reliability, pario-bib
The data layout
techniques this paper investigates are instantiations of the declustered
parity organization, a derivative of RAID Level 5 that allows a system to
trade some of its data capacity for improved failure-recovery performance. We
show that our instantiations of parity declustering improve the failure-mode
performance of an array significantly, and that a parity-declustered
architecture is preferable to an equivalent-size multiple-group RAID Level 5
organization in environments where failure-recovery performance is important.
The presented analyses also include comparisons to a RAID Level 1 (mirrored
disks) approach. With respect to reconstruction algorithms, this paper
describes and briefly evaluates two alternatives stripe-oriented
reconstruction and disk-oriented reconstruction, and establishes that the
latter is preferable as it provides faster reconstruction. The paper then
revisits a set of previously-proposed reconstruction optimizations,
evaluating their efficacy when used in conjunction with the disk-oriented
algorithm. The paper concludes with a section on the reliability versus
capacity trade-off that must be addressed when designing large arrays.
Abstract: The performance of traditional RAID
Level 5 arrays is, for many applications, unacceptably poor while one of its
constituent disks is non-functional. This paper describes and evaluates
mechanisms by which this disk array failure-recovery performance can be
improved. The two key issues addressed are the data layout, the mapping by
which data and parity blocks are assigned to physical disk blocks in an
array, and the reconstruction algorithm, which is the technique used to
recover data that is lost when a component disk fails.
Keywords: parallel I/O, disk array, RAID,
redundancy, reliability, pario-bib
The paper evaluates performance via
detailed simulation on two different disk array architectures: the RAID level
5 organization, and the declustered parity organization. The benefits of the
disk-oriented algorithm can be achieved using controller or host buffer
memory no larger than the size of three disk tracks per disk in the array.
This paper also investigates the tradeoffs involved in selecting the size of
the disk accesses used by the failure recovery process. Abstract: This paper describes and evaluates two
algorithms for performing on-line failure recovery (data reconstruction) in
redundant disk arrays. It presents an implementation of disk-oriented
reconstruction, a data recovery algorithm that allows the reconstruction
process to absorb essentially all the disk bandwidth not consumed by the user
processes, and then compares this algorithm to a previously proposed parallel
stripe-oriented approach. The disk-oriented approach yields better overall
failure-recovery performance.
Keywords: parallel I/O, disk array, RAID,
redundancy, reliability, pario-bib
Parity-based redundant
disk arrays are very attractive storage alternatives for these systems
because they offer both low cost per megabyte and high data reliability.
Unfortunately such systems exhibit poor availability characteristics; their
performance is severely degraded in the presence of a disk failure. This
dissertation addresses the design of parity-based redundant disk arrays that
offer dramatically higher levels of performance in the presence of failure
than systems comprising the current state of the art. We consider two
primary aspects of the failure-recovery problem: the organization of the data
and redundancy in the array, and the algorithm used to recover the lost data.
We apply results from combinatorial theory to generate data and parity
organizations that minimize performance degradation during failure recovery
by evenly distributing all failure-induced workload over a
larger-than-minimal collection of disks. We develop a reconstruction
algorithm that is able to absorb for failure-recovery essentially all of the
array's bandwidth that is not absorbed by the application process(es).
Additionally, we develop a design for a redundant disk array targeted at
extremely high availability through extremely fast failure recovery. This
development also demonstrates the generality of the presented techniques.
Abstract: There exists a wide variety of
applications in which data availability must be continuous, that is, where
the system is never taken off-line and any interruption in the accessibility
of stored data causes significant disruption in the service provided by the
application. Examples include on-line transaction processing systems such as
airline reservation systems, and automated teller networks in banking
systems. In addition, there exist many applications for which a high degree
of data availability is important, but continuous operation is not required.
An example is a research and development environment, where access to a
centrally-stored CAD system is often necessary to make progress on a design
project. These applications and many others mandate both high performance and
high availability from their storage subsystems.
Keywords: parallel I/O, disk arrays, RAID,
redundancy, reliability, pario-bib
Comment: Garth Gibson, advisor.
Keywords: parallel I/O, pario-bib
Comment: A short summary of disk I/O issues: disk
technology, latency reduction, parallel I/O, etc..
Keywords: disk array, reliability, parallel I/O,
pario-bib
Comment: Chained declustering has cost like
mirroring, since it replicates each block, but has better load increase
during failure than mirrors, interleaved declustering, or RAID. (Or parity
striping (my guess)). Has reliability between that of mirrors and RAID, and
much better than interleaved declustering. Would also be much easier in a
distributed environment. See hsiao:diskrep.
Keywords: disk array, reliability, disk mirroring,
parallel I/O, pario-bib
Comment: Compares mirrored disks (MD) with
interleaved declustering (ID) with chained declustering (CD). ID and CD found
to have much better performance in normal and failure modes. See
hsiao:decluster.
Keywords: disk array, reliability, disk mirroring,
parallel I/O, pario-bib
Comment: See hsiao:diskrep.
Keywords: multimedia server, video on demand,
pario-bib
Keywords: parallel I/O, disk cache, disk striping,
disk aray, pario-bib
Comment: Part of jin:io-book; reformatted version
of hu:rapid-cache.
Abstract: This paper presents a new cache
architecture called RAPID-Cache for Redundant, Asymmetrically Parallel, and
Inexpensive Disk Cache. A typical RAPID-Cache consists of two redundant write
buffers on top of a disk system. One of the buffers is a primary cache made
of RAM or NVRAM and the other is a backup cache containing a two level
hierarchy: a small NVRAM buffer on top of a log disk. The backup cache has
nearly equivalent write performance as the primary RAM cache, while the read
performance of the backup cache is not as critical because normal read
operations are performed through the primary RAM cache and reads from the
backup cache happen only during error recovery periods. The RAPID-Cache
presents an asymmetric architecture with a fast-write-fast-read RAM being a
primary cache and a fast-write-slow-read NVRAM-disk hierarchy being a backup
cache. The asymmetric cache architecture allows cost-effective designs for
very large write caches for high-end disk I/O systems that would otherwise
have to use dual-copy, costly NVRAM caches. It also makes it possible to
implement reliable write caching for low-end disk I/O systems since the
RAPID-Cache makes use of inexpensive disks to perform reliable caching. Our
analysis and trace-driven simulation results show that the RAPID-Cache has
significant reliability/cost advantages over conventional single NVRAM write
caches and has great cost advantages over dual-copy NVRAM caches. The
RAPID-Cache architecture opens a new dimension for disk system designers to
exercise trade-offs among performance, reliability and cost.
Keywords: parallel I/O, disk cache, disk striping,
disk aray, pario-bib
Abstract: It has been demonstrated that simulated
annealing provides high-quality results for the data clustering problem.
However, existing simulated annealing schemes are memory-based algorithms;
they are not suited for solving large problems such as data clustering which
typically are too big to fit in the memory space in its entirety. Various
buffer replacement policies, assuming either temporal or spatial locality,
are not useful in this case since simulated annealing is based on a
randomized search process. Poor locality of references will cause the memory
to thrash because too many replacements are required. This phenomenon will
incur excessive disk accesses and force the machine to run at the speed of
the I/O subsystem. In this paper, we formulate the data clustering problem as
a graph partition problem (GPP), and propose a decomposition-based approach
to address the issue of excessive disk accesses during annealing. We apply
the statistical sampling technique to randomly select subgraphs of the GPP
into memory for annealing. Both the analytical and experimental studies
indicate that the decomposition-based approach can dramatically reduce the
costly disk I/O activities while obtaining excellent optimized results.
Keywords: out of core, information retrieval,
parallel I/O, pario-bib
Keywords: parallel file system, parallel I/O,
pario-bib
Comment: Part of jin:io-book, revised version of
huber:ppfs.
Abstract: The I/O problem is described in the
context of parallel scientific applications. A user-level input/output
library, PPFS, is introduced to address these issues. The design and
implementation of PPFS are presented. Some simple performance benchmarks are
reported. Experiments on two production-scale applications are given.
Keywords: parallel file system, pario-bib
Comment: He describes the design and
implementation of PPFS, along with some experimental results. PPFS is a C++
library and a set of servers that implement a parallel file system on top of
unix on a cluster or a Paragon. Interesting features of PPFS include: files
are a sequence of records (fixed size or variable size), read_next and
read_any operations, a no-extend option to reduce overhead of maintaining
file-size information, client and server caching, intermediate caching agents
for consistency, prefetching and write behind, and user-defined declustering
and indexing policies. User-defined changes actually have to be precompiled
into the server programs. Good results in comparison to PFS on the Paragon,
though that doesn't say much. They are porting it to the SP-2.
Abstract: Rapid increases in processor performance
over the past decade have outstripped performance improvements in
input/output devices, increasing the importance of input/output performance
to overall system performance. Further, experience has shown that the
performance of parallel input/output systems is particularly sensitive to
data placement and data management policies, making good choices critical. To
explore this vast design space, we have developed a user-level library, the
Portable Parallel File System (PPFS), which supports rapid experimentation
and exploration. The PPFS includes a rich application interface, allowing the
application to advertise access patterns, control caching and prefetching,
and even control data placement. PPFS is both extensible and portable, making
possible a wide range of experiments on a broad variety of platforms and
configurations. Our initial experiments, based on simple benchmarks and two
application programs, show that tailoring policies to input/output access
patterns yields significant performance benefits, often improving performance
by nearly an order of magnitude.
Keywords: parallel file system, parallel I/O,
pario-bib
Keywords: parallel file system, parallel I/O,
pario-bib
Comment: See also elford:ppfs-tr, huber:ppfs.
Abstract: Rapid increases in processor performance
over the past decade have outstripped performance improvements in
input/output devices, increasing the importance of input/output performance
to overall system performance. Further, experience has shown that the
performance of parallel data management policies, making good choices
critical. To explore this vast design space, we have developed a user-level
library, the Portable Parallel File System (PPFS), which supports rapid
experimentation and exploration. The PPFS includes a rich application
interface, allowing the application to advertise access patterns, control
caching and prefetching, and even control data placement. PPFS is both
extensible and portable, making possible a wide range of experiments on a
broad variety of platforms and configurations. Our initial experiments, based
on on simple benchmarks and two application programs, show that tailoring
policies to input/output access patterns yields significant performance
benefits, often improving performance by nearly an order of magnitude.
Keywords: parallel file system, pario-bib
Comment: They have built a user-level library that
implements a parallel file system on top of a set of vanilla Unix file
systems. Their goals include flexibility and portability, so they can use
PPFS to explore issues in parallel I/O. They allow the application to have
lots of control over data distribution, cache and prefetch policies, etc.
They support fixed- and variable-length records. They support client, server,
and shared caches. This TR includes syntax and specs for all functions. They
include performance for synthetic benchmarks and application codes, compared
with Intel Paragon PFS (which is admittedly not a very tough competitor).
Abstract: Increasing requirements in HPC led to
improvements of CPU power, but bandwidth of I/O subsystems does not keep up
with the performance of processors any more. This problem is commonly known
as the I/O bottleneck. Additionally, new and stimulating data-intensive
problems in biology, physics, astronomy, space exploration, and human genom
research arise, which bring new high-performance applications dealing with
massive data spread over globally distributed storage resources. Therefore
research in HPC focuses more on I/O systems: all leading hardware vendors of
multiprocessor systems provided powerful concurrent I/O subsystems. In
accordance researchers focus on the design of appropriate programming tools
and models to take advantage of the available hardware resources. Numerous
projects about this topic have appeared, from which a large and unmanageable
quantity of publications have come. These publications concern themselves to
a large extent with very special problems. Due to the time of their
appearance the few overview papers deal with Parallel I/O or Cluster I/O.
Substantial progress has been made in these research areas since then. Grid
Computing has emerged as an important new field, distinguished from
conventional Distributed Computing by its focus on large-scale resource
sharing, innovative applications and, in some cases, high-performance
orientation. Over the past five years, research and development efforts
within the Grid community have produced protocols, services and tools that
address precisely the challenges that arise when we try to build Grids, I/O
being an important part of it. Therefore our work gives an overview of I/O in
HPC.
Keywords: parallel i/o, cluster i/o, grid i/o,
distributed computing, pario-bib
Comment: Like stockinger:dictionary, this master's
thesis categorizes and describes a large set of parallel I/O-related projects
and applications.
Keywords: parallel I/O, I/O, pario-bib
Comment: Does FORTRAN format conversion in
software in parallel or in hardware, to obtain good speedups for lots of
programs. However he found that increasing the I/O bandwidth was the most
significant change that could be made in the parallel program.
Abstract: Recently, there have been many efforts
to get high performance in cluster computing with inexpensive PCs connected
through high-speed networks. Some of them were to provide high bandwidth and
parallelism in file service using a distributed file system. Other researches
for distributed file systems include the cooperative cache that reduces
servers' load and improves overall performance. The cooperative cache shares
file caches among clients so that a client can request a file to another
client, not to the server, through inter-client message passing. In various
distributed file systems, PVFS (Parallel Virtual File System) provides high
performance with parallel I/O in Linux widely used in cluster computing.
However, PVFS doesn't support any file cache facility. This paper describes
the design and implementation of the cooperative cache for PVFS (Coopc-PVFS).
We show the efficiency of Coopc-PVFS in comparison to original PVFS. As a
result, the response time of Coopc-PVFS is shorter than or similar to that of
original PVFS.
Keywords: PVFS, cooperative cache, pario-bib
Abstract: A new RAID-x (redundant array of
inexpensive disks at level x) architecture is presented for distributed I/O
processing on a serverless cluster of computers. The RAID-x architecture is
based on a new concept of orthogonal striping and mirroring (OSM) across all
distributed disks in the cluster. The primary advantages of this OSM approach
lie in: (1) a significant improvement in parallel I/O bandwidth, (2) hiding
disk mirroring overhead in the background, and (3) greatly enhanced
scalability and reliability in cluster computing applications. All claimed
advantages are substantiated with benchmark performance results on the
Trojans cluster built at USC in 1999. Throughout the paper, we discuss the
issues of scalable I/O performance, enhanced system reliability, and striped
checkpointing on distributed RAID-x in a serverless cluster environment.
Keywords: parallel I/O, disk array, disk striping,
RAID, pario-bib
Keywords: disk array, multimedia, parallel I/O,
pario-bib
Comment: Petri-net model of disk array using
Information-Dispersal Algorithm (IDA) to stripe data. Continuous-media
workload.
Keywords: multiprocessor architecture, parallel
I/O, pario-bib
Comment: See also information about Vesta file
system, corbett:vesta.
Keywords: file access pattern, parallel I/O, Intel
iPSC/2, hypercube, pario-bib
Comment: Lists several examples and the amount and
types of data they require, and how much bandwidth. Fluid flow modeling,
Molecular modeling, Seismic processing, and Tactical and strategic systems.
Keywords: parallel I/O, hypercube, Intel iPSC/2,
pario-bib
Comment: Simple overview, not much detail. See
intel:ipsc2, pierce:pario, asbury:fortranio. Separate I/O nodes from compute
nodes. Each I/O node has a SCSI bus to the disks, and communicates with other
nodes in the system via Direct-Connect hypercube routing.
Keywords: parallel architecture, parallel I/O,
Intel, pario-bib
Comment: Not a bad glossy. See also esser:paragon.
Keywords: parallel I/O, hypercube, Intel iPSC/2,
pario-bib
Keywords: parallel I/O, parallel I/O architecture,
parallel I/O algorithm, multiprocessor file system, workload
characterization, parallel file access pattern, pario-bib
Comment: A book containing papers from IOPADS '94
and IOPADS '95, plus several survey/tutorial papers. See the bib entries with
cross-ref to iopads-book.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: The entire proceedings is about parallel
I/O.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: The entire proceedings is about parallel
I/O.
Abstract: We provide an overview of different file
system architectures. We show the influence of I/O access pattern studies
results on file system design. We present techniques, algorithms and data
structures used in file system implementations. We overview issues related to
both local and distributed file systems. We describe distributed file system
architectures for different kinds of network connectivity: tightly-connected
networks (clusters and supercomputers), loosely-connected networks
(computational grids) or disconnected computers (mobile computing). File
system architectures for both network-attached and computer-attached storage
are reviewed. We show how the parallel file systems address the requirements
of I/O bound parallel applications. Different file sharing semantics in
distributed and parallel file systems are explored. We also present how
efficient metadata management can be realized in journaled file systems.
Keywords: survey, file system architecture,
pario-bib
Abstract: This paper presents Clusterfile, a
parallel file system that provides parallel file access on a cluster of
computers. We introduce a file partitioning model that has been used in the
design of Clusterfile. The model uses a data representation that is optimized
for multidimensional array partitioning while allowing arbitrary partitions.
The paper shows how the file model can be employed for file partitioning into
both physical subfiles and logical views. We also present how the conversion
between two partitions of the same file is implemented using a general memory
redistribution algorithm. We show how we use the algorithm to optimize
non-contiguous read and write operations. The experimental results include
performance comparisons with the Parallel Virtual File System (PVFS) and an
MPI-IO implementation for PVFS.
Keywords: parallel file system, parallel I/O,
pario-bib
Abstract: This paper presents the integration of
two collective I/O techniques into the Clusterfile parallel file system :
disk-directed I/O and two-phase I/O. We show that global cooperative cache
management improves the collective I/O performance. The solution focuses on
integrating disk parallelism with other types of parallelism: memory (by
buffering and caching on several nodes), network (by parallel I/O scheduling
strategies) and processors (by redistributing the I/O related computation
over several nodes). The performance results show considerable throughput
increases over ROMIO's extended two-phase I/O.
Keywords: disk-directed I/O, two-phase I/O,
clusterfile parallel file system, cooperative cache, pario-bib
Abstract: This paper presents view I/O, a
non-contiguous parallel I/O technique. We show that the linear file model may
be an unsuitable abstraction for non-contiguous I/O optimizations.
Additionally, the poor cooperation between a file system and an I/O library
like MPI-IO may drastically affect the performance. View I/O has detailed
knowledge about parallel structure of a file and about the potential access
pattern and exploits it in order to improve performance. The access overhead
is reduced by using a strategy "declare once, use several times" and by file
off-set compaction. We compare and contrast view I/O with other
non-contiguous I/O methods. Our measurements on a cluster of computers
indicate a significant performance improvement over other approaches. (15
refs.)
Keywords: non-contiguous I/O, parallel file
structure, pario-bib
Keywords: parallel file system, pario-bib
Comment: File system in the PIMOS operating system
for the PIM (Parallel Inference Machine) in the Fifth Generation Computer
Systems project in Japan. Paper design, no results yet. Uses disks that are
attached directly to the computational processors. Significant in that it
does use client caches in a parallel file system. Caches are kept coherent
with a centralized directory-based protocol for exclusive-writer,
multiple-reader semantics, supporting sequential consistency. Disk management
includes logging to surivive crashes. Bitmap free list with buddy system to
support full, 1/2, and 1/4 blocks. Trick to avoid constant update of on-disk
free list. My suspicion is that cache coherence protocol may be expensive,
especially in larger systems.
In
this paper we present a high-performance solution to the I/O retrieval
problem in a distributed multimedia system. Parallelism of data retrieval is
achieved by striping the data across multiple disks. We identify the
different components that contribute to media data retrieval delay. The
variable delays among these have a great bearing on the server throughput
under varying load conditions. We present a buffering scheme to minimize
these variations. We have implemented our model on the Intel Paragon parallel
computer. The results of component-wise instrumentation of the server
operation are presented and analyzed. We present experimental results that
demonstrate the efficacy of the buffering scheme. Based on our experiments, a
dynamic admission control policy that takes server workload into account is
proposed. Abstract: One of the key components of a
multi-user multimedia-on-demand system is the data server. Digitalization of
traditionally analog data such as video and audio, and the feasibility of
obtaining network bandwidths above the gigabit-per-second range are two
important advances that have made possible the realization, in the near
future, of interactive distributed multimedia systems. Secondary-to-main
memory I/O technology has not kept pace with advances in networking, main
memory and CPU processing power. Consequently, the performance of the server
has a direct bearing on the overall performance of such a system.
Keywords: parallel I/O, I/O scheduling,
multimedia, video on demand, pario-bib
Comment: Much more detailed than
jadav:media-on-demand. Here they present less survey information, and all the
details on their Paragon implementation/simulation. They experiment with many
tradeoffs, and propose and evaluate several scheduling and admission-control
algorithms.
Abstract: One of the key components of a multi
user multimedia on demand system is the data server. Digitization of
traditionally analog data such as video and audio, and the feasibility of
obtaining network bandwidths above the gigabit per second range are two
important advances that have made possible the realization, in the near
future, of interactive distributed multimedia systems. Secondary-to-main
memory I/O technology has not kept pace with advances in networking, main
memory and CPU processing power. Consequently, the performance of the server
has a direct bearing on the overall performance of such a system. We develop
a model for the architecture of a server for such a system. Parallelism of
data retrieval is achieved by striping the data across multiple disks. The
performance of any server ultimately depends on the data access patterns. Two
modifications of the basic retrieval algorithm are presented to exploit data
access patterns in order to improve system throughput and response time. A
complementary information caching optimization is discussed. Finally, we
present performance results of these algorithms on the IBM SP1 and Intel
Paragon parallel computers.
Keywords: parallel I/O, multimedia, pario-bib
Comment: Journal version is jadav:j-ioschedule?
See also jadav:media-on-demand. [Comments based on a much earlier version.]
They propose I/O scheduling algorithms for multimedia file servers. They
assume an MIMD architecture with no shared memory and with a disk on every
node. One node is essentially a manager for new requests. Another set are
interface nodes, each managing the data flow for a few multimedia data
streams. The majority are server nodes, responsible just for fetching their
data from disk and sending it to the interface nodes. The interface nodes
assemble data from the server nodes into a data stream, and send it on out to
the client. They describe algorithms for scheduling requests from the
interface node to the server node, and for sending data out to the client.
They also describe an algorithm for determining whether the system can accept
a new request.
Keywords: multimedia, scheduling, parallel I/O,
pario-bib
Comment: See also jadav:ioschedule,
jadav:j-ioschedule.
Keywords: parallel I/O, I/O scheduling,
multimedia, video on demand, pario-bib
Comment: Conference version is jadav:ioschedule;
similar abstract. See jadav:media-on-demand.
Abstract: A server for an interactive distributed
multimedia system may require thousands of gigabytes of storage space and
high I/O bandwidth. In order to maximize system utilization, and thus
minimize cost, the load must be balanced among the server's disks,
interconnection network and scheduler. Many algorithms for maximizing
retrieval capacity from the storage system have been proposed. This paper
presents techniques for improving server capacity by assigning media requests
to the nodes of a server so as to balance the load on the interconnection
network and the scheduling nodes. Five policies for dynamic request
assignment are developed. An important factor that affects data retrieval in
a high-performance continuous media server is the degree of parallelism of
data retrieval. The performance of the dynamic policies on an implementation
of a server model developed earlier is presented for two values of the degree
of parallelism.
Keywords: multimedia, parallel I/O, pario-bib
Abstract: This paper discusses the architectural
requirements of a multimedia-on-demand system, with special emphasis on the
media server. Although high-performance computers are the best choice for
building media-on-demand servers, implementation poses many difficulties. We
conclude with a discussion of the open issues regarding the design and
implementation of the server.
Keywords: parallel I/O, multimedia, video on
demand, pario-bib
Comment: A survery of the issues involved in
designing a media-on-demand server (they really focus on temporal data like
video and audio). They do have a few results comparing various granularities
for disk-requests and network messages, which seem to be from an Intel
Paragon implementation, although they do not describe the experimental setup.
See jadav:evaluation, jadav:j-ioschedule, jadav:ioschedule.
Keywords: wireless communication, mobile
computing, RAID, parallel I/O, pario-bib
Comment: They discuss the idea of broadcasting a
disk's data over the air, so PDAs can 'read' the disk by waiting for the
necessary data to come along. Good for read-only or write-rarely disks. They
discuss the idea of dividing the air into multiple (frequency or time) tracks
and 'striping' data across the tracks for better bandwidth and reliability.
We describe a simple I/O scheduling problem and present
approximate algorithms for its solution. The costs of using these algorithms
in terms of execution time, and the benefits in terms of reduced time to
complete a batch of I/O operations, are compared with the situations in which
no scheduling is used, and in which an optimal scheduling algorithm is used.
The comparison is performed both theoretically and experimentally. We have
found that, in exchange for a small execution time overhead, the approximate
scheduling algorithms can provide substantial improvements in I/O completion
times. Abstract: The I/O bottleneck in parallel computer
systems has recently begun receiving increasing interest. Most attention has
focused on improving the performance of I/O devices using fairly low-level
parallelism in techniques such as disk striping and interleaving. Widely
applicable solutions, however, will require an integrated approach which
addresses the problem at multiple system levels, including applications,
systems software, and architecture. We propose that within the context of
such an integrated approach, scheduling parallel I/O operations will become
increasingly attractive and can potentially provide substantial performance
benefits.
Keywords: network, graph coloring, multiprocessor
file system, resource allocation, scheduling, parallel I/O, pario-bib
Comment: See also jain:pario
Keywords: parallel I/O, shared memory, scheduling,
pario-bib
Comment: An algorithm to schedule (off-line) a set
of transfers between P procs and D disks, such that no proc or disk does more
than one request at a time, and no more than K transfers are concurrent (due
to channel limits), with integer arbitrary-length transfers that are
preemptable (ie segmentable). Much faster than previous algorithms. Problems,
IMHO: off-line is only good for batch executions with known needs (ok for big
collective I/Os I suppose). All k channels are usable by all proc-disk pairs,
may not be realistic. No accomodation for big difference in disk and channel
time, ie, disk probably can't do a channel transfer every time unit. Allows
transfers in any order, which means disk seeks could be bad. No cost for
preemption of a transfer, which could mean more message overhead if more
messages are needed to do a given transfer. Assumes all transfers have
predictable time. Still, it could be useful in some situations, esp. where
order really doesn't matter.
Abstract: We sketch the reasons for the I/O
bottleneck in parallel and distributed systems, pointing out that it can be
viewed as a special case of a general bottleneck that arises at all levels of
the memory hierarchy. We argue that because of its severity, the I/O
bottleneck deserves systematic attention at all levels of system design. We
then present a survey of the issues raised by the I/O bottleneck in five key
areas of parallel and distributed systems: applications, algorithms,
compilers, operating systems and architecture. Finally, we address some of
the trends we observe emerging in new paradigms of parallel and distributed
computing: the convergence of networking and I/O, I/O for massively
distributed ``global information systems'' such as the World Wide Web, and
I/O for mobile computing and wireless communications. These considerations
suggest exciting new research directions in I/O for parallel and distributed
systems in the years to come.
Keywords: parallel I/O, out-of-core, pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Keywords: parallel I/O, file access pattern,
multiprocessor file system, pario-bib
Comment: He looks at the effect of I/O traffic on
memory access in a multistage network, and custom mappings of file data to
disks to support non-sequential I/O. He considers both the traditional
``multiuser'' workload and the case where a application accesses a single
file in parallel. Assumes a dance-hall shared-memory MIMD base architecture
(CEDAR). Disks are attached either to the memory or processor side of the
network, and in either case require four network traversals per read/write
operation. Nice summary of previous parallel I/O architectures, and
characterization of the workload. Main conclusions: the network is not an
inherent bottleneck, but I/O traffic can cause up to 50% loss in memory
traffic bandwidth, and bursts of I/O can saturate the network. For a high I/O
request rate (eg, all procs active), spread each request over a small number
of disks (eg, one), whereas for a low I/O request rate (eg, one proc active)
spread each request over lots of disks (eg, all). This avoids cache thrashing
when multiple procs hit on one disk node. However, if they are all reading
the same data, then there is no cache thrashing and you want to maximize
parallelism across disks. When accessing disjoint parts of the same file, it
is sometimes better to have one proc do all the accesses, because this avoids
out-of-order requests that spoil prefetching, and it avoids contention by
multiple procs. No single file-to-disk mapping worked for everything;
interleaved (striped) worked well for most sequential patterns, but
``sequential'' (partitioned) mappings worked better for multiple-process
loads that tend to focus each process on a disk, eg, an interleaved pattern
where the stride is equal to the number of disks. Thus, if your pattern can
get you disk locality, use a mapping that will provide it.
Keywords: parallel I/O, pario-bib
Comment: Ways to distribute data across multiple
disks to speed information retrieval, given an inverted index. Based on a
shared-everything multiprocessor model.
Keywords: RAID, disk array, parallel file system,
caching, prefetching, multiprocessor file system, parallel I/O applications,
parallel I/O, pario-bib
Comment: An excellent collection of papers that
were mostly published earlier.
Abstract: To access a RAID (redundant arrays of
inexpensive disks), the disk stripe size greatly affects the performance of
the disk array. In this article, we present a performance model to analyze
the effects of striping with different stripe sizes in a RAID. The model can
be applied to optimize the stripe size. Compared with previous approaches,
our model is simpler to apply and more accurately reveals the real
performance. Both system designers and users can apply the model to support
parallel I/O events
Keywords: parallel I/O, RAID, disk striping,
pario-bib
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: If you insert blocks into a striped file,
you mess up the nice striping. So he breaks the file into striped extents,
and keeps track of the extents with a distributed B-tree index. Deletions
also fit into the same scheme.
Abstract: The evolution of system architectures
and system configurations has created the need for a new supercomputer system
interconnect. Attributes required of the new interconnect include commonality
among system and subsystem types, scalability, low latency, high bandwidth, a
high level of resiliency, and flexibility. Cray Research Inc. is developing a
new system channel to meet these interconnect requirements in future systems.
The channel has a ring-based architecture, but can also function as a
point-to-point link. It integrates control and data on a single, physical
path while providing low latency and variance for control messages. Extensive
features for client isolation, diagnostic capabilities, and fault tolerance
have been incorporated into the design. The attributes and features of this
channel are discussed along with implementation and protocol specifics.
Keywords: mass storage, I/O architecture, I/O
interconnect, supercomputer, parallel I/O, pario-bib
Comment: About the Cray Research SCX channel,
capable of 1200 MB/s peak and 900 MB/s delivered throughput.
Keywords: computational physics, parallel I/O,
pario-bib
Comment: Old paper on the need for large memory
and fast paging and I/O in out-of-core solutions to 3-d seismic modeling.
They used 4-way parallel I/O to support their job. Needed to transfer a 3-d
matrix in and out of memory by rows, columns, and vertical columns. Stored in
block-structured form to improve locality on the disk.
Keywords: mass storage, parallel I/O,
multiprocessor file system interface, pario-bib
Keywords: parallel computing, scientific
computing, weather prediction, global climate model, parallel I/O, pario-bib
Comment: They talk about a weather code. There's a
bit about the parallel I/O issues. They periodically write a restart file,
and they write out several types of data files. They write out the data in
any order, with a little mini-header in each chunk that describes the chunk.
I/O was not a significant percentage of their run time on either the CM5 or
C90. See hammond:atmosphere and hack:ncar in the same issue.
Abstract: Buffer management for a D-disk parallel
I/O system is considered in the context of randomized placement of data on
the disks. A simple prefetching and caching algorithm PHASE-LRU using bounded
lookahead is described and analyzed. It is shown that PHASE-LRU performs an
expected number of I/Os that is within a factor Theta(log D/log log D) of the
number performed by an optimal off-line algorithm. In contrast, any
deterministic buffer management algorithm with the same amount of lookahead
must do at least Omega(rootD) times the number of I/Os of the optimal. (C)
2004 Elsevier B.V. All rights reserved.
Keywords: parallel I/O, prefetching, data
placement, caching, buffer management, analysis, algorithms, randomization,
pario-bib
Abstract: We address the problem of prefetching
and caching in a parallel I/O system and present a new algorithm for parallel
disk scheduling. Traditional buffer management algorithms that minimize the
number of block misses are substantially suboptimal in a parallel I/O system
where multiple I/Os can proceed simultaneously. We show that in the offline
case, where a priori knowledge of all the requests is available, PC-OPT
performs the minimum number of I/Os to service the given I/O requests. This
is the first parallel I/O scheduling algorithm that is provably offline
optimal in the parallel disk model. In the online case, we study the context
of global L-block lookahead, which gives the buffer management algorithm a
lookahead consisting of L distinct requests. We show that the competitive
ratio of PC-OPT, with global L-block lookahead, is Θ (M - L + D), when L
< M, and Θ (M D / L), when L > M, where the number of disks is D and
buffer size is M.
Keywords: parallel I/O, file prefetching,
pario-bib
We
present a new algorithm Super for parallel-disk I/O scheduling. We show that
in the off-line case, where a priori knowledge of all the requests is
available, Super performs the minimum number of I/Os to service the given I/O
requests. This is the first parallel I/O scheduling algorithm that is
provably offline optimal. In the on-line case, we study Super in the context
of global L-block lookahead, which gives the buffer management algorithm a
lookahead consisting of L distinct requests. We show that the competitive
ratio of Super, with global L-block lookahead, is Theta(M-L+D), when L < M,
and Theta(MD/L), when L >= M, where the number of disks is D and buffer size
is M. Abstract: We address the problem of prefetching
and caching in a parallel I/O system and present a new algorithm for optimal
parallel-disk scheduling. Traditional buffer management algorithms that
minimize the number of I/O disk accesses, are substantially suboptimal in a
parallel I/O system where multiple I/Os can proceed simultaneously.
Keywords: parallel I/O, prefetch, disk cache,
pario-bib
Abstract: We present an optimal algorithm, L-OPT,
for prefetching and I/O scheduling in parallel I/O systems using a read-once
model of block reference. The algorithm uses knowledge of the next L block
references, L-block lookahead, to schedule I/Os in an on-line manner. It uses
a dynamic priority assignment scheme to decide when blocks should be
prefetched, so as to minimize the total number of I/Os. The parallel disk
model of an I/O system is used to study the performance of L-OPT. We show
that L-OPT is comparable to the best on-line algorithm with the same amount
of lookahead; the ratio of the length of its schedule to the length of the
optimal schedule is within a constant factor of the best possible.
Specifically, we show that the competitive ratio of L-OPT is
$Ω(\sqrt{MD/L})$ which matches the lower bound on the competitive ratio
of any prefetching algorithm with L-block lookahead. In addition we show that
when the lookahead consists of the entire reference string, L-OPT performs
the minimum possible number of I/Os; hence L-OPT is the optimal off-line
algorithm. Finally, using synthetic traces we empirically study the
performance characteristics of L-OPT.
Keywords: disk scheduling, parallel I/O, pario-bib
Keywords: parallel I/O, multimedia, multiprocessor
file system, pario-bib
Comment: Hook a video-display system to the
compute node of an SP-1 running Vesta, and then use Vesta file system to
serve the video.
Abstract: Video on Demand (VoD) servers are
expected to serve hundreds of customers with as many, or more, movie videos.
Such an environment requires large storage capacity and real-time,
high-bandwidth transmission capabilities. Massive striping of videos across
disk arrays is a viable means to store large amounts of video data and,
through parallelism of file access, achieve the needed bandwidth. The Vesta
Parallel File System facilitates parallel access from an application to files
distributed across a set of I/O processors, each with a set of attached
disks. Given Vesta's parallel file access capabilities, this paper examines a
number of issues pertaining to the implementation of VoD services on top of
Vesta. We develop a prototype VoD experimentation environment on an IBM SP-1
and analyze Vesta's performance in video data retrieval for real-time
playback. Specifically, we explore the impact of concurrent video streams
competing for I/O node resources, cache effects, and video striping across
multiple I/O nodes.
Keywords: parallel I/O, parallel file system,
video on demand, multimedia, pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Keywords: parallel file system, parallel I/O,
pario-bib
Abstract: any large-scale applications have
significant I/O requirements as well as computational and memory
requirements. Unfortunately, the limited number of I/O nodes provided in a
typical configuration of the modern message-passing distributed-memory
architectures such as the Intel Paragon and the IBM SP-2 limits the I/O
performance of these applications severely. In this paper, we examine some
software optimization techniques and evaluate their effects in five different
I/O-intensive codes from both small and large application domains. Our goals
in this study are twofold. First, we want to understand the behavior of
large-scale data-intensive applications and the impact of I/O subsystems on
their performance and vice versa. Second, and more importantly, we strive to
determine the solutions for improving the applications' performance by a mix
of software techniques. Our results reveal that different applications can
benefit from different optimizations. For example, we found that some
applications benefit from file layout optimizations, whereas others take
advantage of collective I/O. A combination of architectural and software
solutions is normally needed to obtain good I/O performance. For example, we
show that with a limited number of I/O resources, it is possible to obtain
good performance by using appropriate software optimizations. We also show
that beyond a certain level, imbalance in the architecture results in
performance degradation even when using optimized software, thereby
indicating the necessity of an increase in I/O resources.
Keywords: parallel I/O, parallel application,
pario-bib
Abstract: Parallel machines are an important part
of the scientific application developer's tool box and the computational and
processing demands placed on these machines are rapidly increasing. Many
scientific applications tend to perform high volume data storage, data
retrieval and data processing, which demands high performance from the I/O
subsystem. In this paper, we conduct an experimental study of the I/O
performed by the Hartree-Fock (HF) method, as implemented using a fully
distributed data approach in the NWChem parallel computational chemistry
package. We use PASSION, a parallel and scalable I/O library and it's
optmizations such as prefetching to improve the I/O performance of the HF
application and we also present extensive experimental results of the same.
The effects of both application-related factors and system-related factors on
the application's I/O performance are studied in detail. We rank the
optimizations based on the significance and impact on the performance of HF's
I/O phase as: I. efficient interface to the file system, II. prefetching
optimization, and III. buffering. The results show that within the limits of
our experimental parameters, application-related factors are more effective
on the overall I/O behavior of this application. We obtained up to 95%
improvement in I/O time and 43% improvement in the overall application
performance with these optimizations.
Keywords: parallel I/O, scientific computing,
pario-bib
Comment: No page numbers: proceedings on CDROM and
web only.
Abstract: Many scientific applications tend to
perform high volume data storage, data retrieval and data processing, which
demands high performance from the I/O subsystem. The focus and contribution
of this work is to study the I/O behavior of the Hartree-Fock method using
PASSION. HF's I/O phases can contribute up to 62.34% of the total execution
time. We reduce the execution time and I/O time up to 54% and 6%
respectively of that of the original case through PASSION and its
optimizations. Additionally, we categorize the factors that affect the I/O
performance of HF into key application-related parameters and key
system-related parameters. Based on extensive empirical results and within
our experimental space, we order the parameters according to their impact on
HF's I/O performance as follows: efficient interface, prefetching, buffering,
number of I/O nodes, striping factor and striping unit. We conclude that
application-related factors have a more significant effect on HF's I/O
performance than the system-related factors within our experimental space.
Keywords: parallel I/O application, pario-bib
Comment: In a Special Issue on I/O in Parallel
Applications, volume 12, numbers 3 and 4.
Abstract: Current approaches to parallel I/O
demand extensive user effort to obtain acceptable performance. This is in
part due to difficulties in understanding the characteristics of a wide
variety of I/O devices and in part due to inherent complexity of I/O
software. While parallel I/O systems provide users with environments where
persistent data sets can be shared between parallel processors, the ultimate
performance of I/O-intensive codes depends largely on the relation between
data access patterns exhibited by parallel processors and storage patterns of
data in files and on disks. In cases where access patterns and storage
patterns match, we can exploit parallel I/O hardware by allowing each
processor to perform independent parallel I/O. In order to keep performance
decent under circumstances in which data access patterns and storage patterns
do not match, several I/O optimization techniques have been developed in
recent years. Collective I/O is such an optimization technique that enables
each processor to do I/O on behalf of other processors if doing so improves
the overall performance. While it is generally accepted that collective I/O
and its variants can bring impressive improvements as far as the I/O
performance is concerned, it is difficult for the programmer to use
collective I/O in an optimal manner. In this paper, we propose and evaluate a
compiler-directed collective I/O approach which detects the opportunities for
collective I/O and inserts the necessary I/O calls in the code automatically.
An important characteristic of the approach is that instead of applying
collective I/O indiscriminately, it uses collective I/O selectively only in
cases where independent parallel I/O would not be possible or would lead to
an excessive number of I/O calls. The approach involves compiler-directed
access pattern and storage pattern detection schemes that work on a multiple
application environment. We have implemented the necessary algorithms in a
source-to-source translator and within a stand-alone tool. Our experimental
results on an SGI/Cray Origin 2000 multiprocessor machine demonstrate that
our compiler-directed collective I/O scheme performs very well on different
setups built using nine applications from several scientific benchmarks. We
have also observed that the I/O performance of our approach is only 5.23
percent worse than an optimal scheme.
Keywords: parallel I/O, collective I/O, compiler,
pario-bib
Abstract: Since many of large scale computational
problems usually deal with large quantities of data, optimizing the
performance of I/O subsystems of massively parallel machines is an important
challenge for system designers. We describe data access reorganization
strategies for efficient compilation of out-of-core data-parallel programs on
distributed memory machines. Our analytical approach and experimental results
indicate that the optimizations introduced in this paper can reduce the
amount of time spent in I/O by as much as an order of magnitude on both
uniprocessors and multicomputers.
Keywords: parallel I/O, compiler, out-of-core,
pario-bib
Abstract: This paper presents compiler algorithms
to optimize out-of-core programs. These algorithms consider loop and data
layout transformations in a unified framework. The performance of an
out-of-core loop nest containing many references can be improved by a
combination of restructuring the loops and file layouts. This approach
considers array references one-by-one and attempts to optimize each reference
for parallelism and locality. When there are references for which parallelism
optimizations do not work, communication is vectorized so that data transfer
can be performed before the innermost tiling loop. Preliminary results from
hand-compiles on IBM SP-2 and Intel Paragon show that this approach reduces
the execution time, improves the bandwidth speedup and overall speedup. In
addition, we extend the base algorithm to work with file layout constraints
and show how it can be used for optimizing programs consisting of multiple
loop nests.
Keywords: compiler, out of core, parallel I/O,
pario-bib
Keywords: parallel I/O, compiler, out of core,
pario-bib
Keywords: parallel I/O, compiler, pario-bib
Abstract: Programs accessing disk-resident arrays
perform poorly in general due to excessive number of I/O calls and
insufficient help from compilers. In this paper, in order to alleviate this
problem, we propose a series of compiler optimizations. Both the analytical
approach we use and the experimental results provide strong evidence that our
method is very effective on uniprocessors for out-of-core nests whose data
sizes far exceed the size of available memory.
Keywords: parallel I/O, compiler, out-of-core,
pario-bib
Keywords: out of core, parallel I/O, pario-bib
Abstract: This paper describes optimization
techniques for translating out-of-core programs written in a data parallel
language to message passing node programs with explicit parallel I/O. We
demonstrate that straightforward extension of in-core compilation techniques
does not work well for out-of-core programs. We then describe how the
compiler can optimize the code by (1) determining appropriate file layouts
for out-of-core arrays, (2) permuting the loops in the nest(s) to allow
efficient file access, and (3) partitioning the available node memory among
references based on I/O cost estimation. Our experimental results indicate
that these optimizations can reduce the amount of time spent in I/O by as
much as an order of magnitude.
Keywords: compiler, data-parallel, out-of-core,
parallel I/O, pario-bib
Abstract: This paper describes a framework by
which an out-of-core stencil program written in a data-parallel language can
be translated into node programs in a distributed-memory message-passing
machine with explicit I/O and communication. We focus on a technique called
\emph{Data Space Tiling} to group data elements into slabs that can fit into
memories of processors. Methods to choose \emph{legal} tile shapes under
several constraints and deadlock-free scheduling of tiles are investigated.
Our approach is \emph{unified} in the sense that it can be applied to both
FORALL loops and the loops that involve flow-dependences.
Keywords: parallel I/O, compiler, out-of-core,
pario-bib
Keywords: interprocess communication, parallel
I/O, pario-bib
Comment: A parallel version of the Unix 'pipe'
feature, for connecting the output of one program to multiple other programs
or files. Implemented on Solaris. Performance results.
Keywords: scientific database, parallel I/O,
pario-bib
Comment: See also karpovich:case-study. That is a
subset of this paper.
Keywords: scientific database, parallel I/O,
pario-bib
Comment: Apparently a subset of
karpovich:bottleneck. They store a sparse, multidimensional data set (radio
astronomy data) as a set of tagged data values, ie, as a set of tuples, each
with several keys and a data value. They use a PLOP format to partition each
dimension into slices, so that each intersection of the slices forms a
bucket. They decide on the splits based on a preliminary statistical survey
of the data. Bucket overflow is handled by chaining. Then, they evaluate
various kinds of queries, ie, multidimensional range queries, for their
performance. In this workload queries (reads) are much more common than
updates (writes).
Keywords: parallel I/O, multiprocessor file system
interface, object oriented, pario-bib
Comment: See also grimshaw:elfs, grimshaw:ELFSTR,
grimshaw:objects, and karpovich:*. This is also available as UVA TR CS-94-28.
This paper focuses more on the objec-oriented nature of ELFS than on its
ability to support parallel I/O. It also describes two classes they've
developed, one for 2d dense matrices and another for range queries on
multidimensional sparse data sets. It does have some new performance numbers
for ELFS on Intel CFS.
Keywords: parallel I/O, RAID, disk array, disk
striping, pario-bib
Comment: Part of jin:io-book; reformatted version
of katz:diskarch.
Keywords: parallel I/O, RAID, disk array, disk
striping, pario-bib
Comment: Good review of the background of disks
and I/O architectures, but a shorter RAID presentation than patterson:RAID.
Also addresses controller structure. Good ref for the I/O crisis background,
though they don't use that term here. Good taxonomy of previous array
techniques.
Keywords: parallel I/O, RAID, Sprite, reliability,
disk striping, disk array, pario-bib
Comment: Early RAID project paper. Describes the
Berkeley team's plan to use an array of small (100M) hard disks as an I/O
server for network file service, transaction processing, and supercomputer
I/O. Considering performance, reliability, and flexibility. Initially hooked
to their SPUR multiprocessor, using Sprite operating system, new filesystem.
Either asynchronous striped or independent operation. Supercomputer I/O is
characterized as sequential, minimum latency, low throughput. Use of parity
disks to boost reliability. Files may be striped across one or more disks and
extend over several sectors, thus a two-dimensional filesystem; striping need
not involve all disks.
Keywords: distributed file system, supercomputer
file system, file striping, RAID, parallel I/O, pario-bib
Comment: Comments on the emerging trend of file
systems for mainframes and supercomputers that are not attached directly to
the computer, but instead to a network attached to the computer. Avoiding
data copying seems to be a critical issue in the OS and controllers, for disk
and network interfaces. Describes RAID-II prototype.
Keywords: parallel I/O, RAID, reliability, disk
array, pario-bib
Comment: A short summary of the RAID project. Some
more up-to-date info, like that they have completed the first prototype with
8 SCSI strings and 32 disks.
The issues of concern to such users differ from those of concern to
science and engineering users. Performance of the parallel system is not the
only, or indeed primary, reason for moving to such systems for commercial
users. Infra-structure issues are important, such as system availability and
inter-working with existing systems. These issues are discussed in the
context of a banking customer's requirements. The various technical concerns
that these requirements impose are discussed in terms of commercially
available systems. Abstract: This paper reports on part of an
on-going analysis of parallel systems for commercial users. The particular
focus of this paper is on the requirements that commercial users, in
particular users with financial database systems, have of parallel systems.
Keywords: parallel architecture, parallel I/O,
databases, commercial requirements, pario-bib
Keywords: parallel I/O, out-of-core, pario-bib
Comment: They describe a project they are
beginning, which attempts to have the compiler analyze a program that uses
large arrays, and insert explicit I/O statements to move data from those
arrays to and from disk. This is seen as an alternative to OS and hardware
virtual memory, and is likely to provide much better performance (so show
their initial results). Their focus is on overlapping I/O and computation.
Abstract: Parallel single-level store (PSLS)
systems integrate a shared virtual memory and a parallel file system. They
provide programmers with a global address space including both memory and
file data. PSLS systems implemented in a cluster thus represent a natural
support for long-running parallel applications, combining both the natural
shared memory programming model and a large and efficient file system.
However, the need to tolerate failures in such a system increases with the
size of applications. We present a highly-available parallel single level
store system (HA-PSLS), which smoothly integrates a backward error recovery
high-availability mechanism into a PSLS system. Our system is able to
tolerate multiple transient failures, a single permanent failure, and power
cut failures affecting the whole cluster, without requiring any specialized
hardware. For this purpose, HA-PSLS relies on a high degree of integration
(and reusability) of high-availability and standard features. A prototype
integrating our high-availability support has been implemented and we show
some performance results. (24 refs.)
Keywords: parallel single level store,
high-availability, fault tolerance, checkpointing, replication, integration,
parallel file systems, shared virtual memory, pario-bib
Abstract: We describe the effects of a new
user-level library for the Galley Parallel File System. This library allows
some pre-existing sequential programs to make use of the Galley Parallel File
System with minimal modification. It permits programs to efficiently use the
parallel file system because the user-level library groups accesses together.
We examine the performance of our library, and we show how code needs to be
modified to use the library.
Keywords: multiprocessor file system interface,
run-time library, parallel file system, parallel I/O, pario-bib, dfk
Keywords: disk interleaving, parallel I/O,
performance modeling, pario-bib
Comment: As opposed to synchronous disk
interleaving, where disks are rotationally synchronous and one access is
processed at a time. They develop a performance model and validate it with
traces of a database system's disk accesses. Average access delay on each
disk can be approximated by a normal distribution.
Keywords: parallel I/O, disk striping, scientific
computing, algorithm, pario-bib
Keywords: parallel I/O, disk striping, file access
pattern, disk array, pario-bib
Comment: Uniprocessor interleaving techniques.
Good case for interleaving. Probably better to reference kim:interleaving and
kim:fft. Discusses an 3D FFT algorithm in which the matrix is broken into
subblocks that are accessed in layers. The layers are stored so this is
either contiguous or with a regular stride, in fairly large chunks.
Keywords: parallel I/O, disk striping, disk array,
pario-bib
Keywords: disk prefetching, parallel I/O,
pario-bib
Comment: They do a theoretical analysis of
prefetching and caching in uniprocessor, single- and multi-disk situations,
given that they know the complete access sequence; their measure is not hit
rate but rather overall execution time. They found some algorithms that are
close to optimal.
Abstract: High-performance I/O systems depend on
prefetching and caching in order to deliver good performance to applications.
These two techniques have generally been considered in isolation, even though
there are significant interactions between them; a block prefetched too early
reduces the effectiveness of the cache, while a block cached too long reduces
the effectiveness of prefetching. In this paper we study the effects of
several combined prefetching and caching strategies for systems with multiple
disks. Using disk-accurate trace-driven simulation, we explore the
performance characteristics of each of the algorithms in cases in which
applications provide full advance knowledge of accesses using hints. Some of
the strategies have been published with theoretical performance bounds, and
some are components of systems that have been built. One is a new algorithm
that combines the desirable characteristics of the others. We find that when
performance is limited by I/O stalls, aggressive prefetching helps to
alleviate the problem; that more conservative prefetching is appropriate when
significant I/O stalls are not present; and that a single, simple strategy is
capable of doing both.
Keywords: parallel I/O, tracing, prefetch,
trace-driven simulation, pario-bib
Keywords: parallel database, parallel I/O,
pario-bib
Comment: Most interesting to me in this paper is
their discussion of the ``container model,'' in which they claim they allow
the processors to be driven by the I/O devices. See hirano:deadlock.
Abstract: We have developed a threaded parallel
data streaming approach using Globus to transfer multi-terabyte simulation
data from a remote supercomputer to the scientist's home
analysis/visualization cluster, as the simulation executes, with negligible
overhead. Data transfer experiments show that this concurrent data transfer
approach is more favorable compared with writing to local disk and then
transferring this data to be post-processed. The present approach is
conducive to using the grid to pipeline the simulation with post-processing
and visualization. We have applied this method to the Gyrokinetic Toroidal
Code (GTC), a 3-dimensional particle-in-cell code used to study
micro-turbulence in magnetic confinement fusion from first principles plasma
theory.
Keywords: grid, parallel data streams,
hydrodynamics, application, parallel I/O, pario-app, pario-bib
Comment: published on the web
Abstract: In this paper, we describe the design
and implementation of the Platform Independent Parallel Solver (PIPSolver)
package for the out-of-core (OOC) solution of complex dense linear systems.
Our approach is unique in that it allows essentially all of RAM to be filled
with the current portion of the matrix (slab) to be updated and factored,
thereby greatly improving the computation to I/O ratio over previous
approaches. This work could be viewed in part as an exercise in maximal code
reuse: By formulating the OOC LU factorization just right, we managed to
reuse essentially all of a very robust and efficient incore solver, leading
directly to a very robust and efficient OOC solver. Experiences and
performance are reported for the Cray T3D system.
Keywords: out-of-core algorithm, parallel I/O,
pario-bib
Abstract: Mission to Planet Earth (MTPE) is a
long-term NASA research mission to study the processes leading to global
climate change. The EOS Data and Information System (EOSDIS) is the component
within MTPE that will provide the Earth science community with easy,
affordable, and reliable access to Earth science data. EOSDIS is a
distributed system, with major facilities at eight Distributed Active Archive
Centers (DAACs) located throughout the United States. At the DAACs the
Science Data Processing Segment (SDPS) will receive, process, archive, and
manage all data. It is estimated that several hundred gigaflops of processing
power will be required to process and archive the several terabytes of new
data that will be generated and distributed daily. Thousands of science users
and perhaps several hundred thousand nonscience users will access the system.
Keywords: mass storage, I/O architecture, parallel
I/O, pario-bib
Abstract: Scientific applications are increasingly
being implemented on massively parallel supercomputers. Many of these
applications have intense I/O demands, as well as massive computational
requirements. This paper is essentially an annotated bibliography of papers
and other sources of information about scientific applications using parallel
I/O. It will be updated periodically.
Keywords: parallel I/O application, file access
patterns, dfk, pario-bib
Abstract: Many scientific applications that run on
today's multiprocessors, such as weather forecasting and seismic analysis,
are bottlenecked by their file-I/O needs. Even if the multiprocessor is
configured with sufficient I/O hardware, the file-system software often fails
to provide the available bandwidth to the application. Although libraries and
enhanced file-system interfaces can make a significant improvement, we
believe that fundamental changes are needed in the file-server software. We
propose a new technique, disk-directed I/O, to allow the disk servers to
determine the flow of data for maximum performance. Our simulations show that
tremendous performance gains are possible both for simple reads and writes
and for an out-of-core application. Indeed, our disk-directed I/O technique
provided consistent high performance that was largely independent of data
distribution, obtained up to 93% of peak disk bandwidth, and was as much as
18 times faster than the traditional technique.
Keywords: parallel I/O, multiprocessor file
system, file system caching, dfk, pario-bib
Comment: In jin:io-book, reprinted from
kotz:jdiskdir.
Abstract: Improvements in the processing speed of
multiprocessors are outpacing improvements in the speed of disk hardware.
Parallel disk I/O subsystems have been proposed as one way to close the gap
between processor and disk speeds. In a previous paper we showed that
prefetching and caching have the potential to deliver the performance
benefits of parallel file systems to parallel applications. In this paper we
describe experiments with practical prefetching policies that base decisions
only on on-line reference history, and that can be implemented efficiently.
We also test the ability of these policies across a range of architectural
parameters.
Keywords: dfk, parallel file system, prefetching,
disk caching, parallel I/O, MIMD, pario-bib
Comment: Reformatted version of kotz:jpractical.
In jin:io-book.
Abstract: Many scientific applications that run on
today's multiprocessors are bottlenecked by their file I/O needs. Even if the
multiprocessor is configured with sufficient I/O hardware, the file-system
software often fails to provide the available bandwidth to the application.
Although libraries and improved file-system interfaces can make a significant
improvement, we believe that fundamental changes are needed in the
file-server software. We propose a new technique, disk-directed I/O,
that flips the usual relationship between server and client to allow the
disks (actually, disk servers) to determine the flow of data for maximum
performance. Our simulations show that tremendous performance gains are
possible. Indeed, disk-directed I/O provided consistent high performance that
was largely independent of data distribution, and close to the maximum disk
bandwidth.
Keywords: parallel I/O, multiprocessor file
system, file system caching, pario-bib, dfk
SEE TECH REPORT kotz:diskdir-tr.
Please note that the tech report contains
newer numbers than those in the OSDI version, although the conclusions have
not changed. Comment: This paper also appeared in Bulletin of
the IEEE Technical Committee on Operating Systems and Application
Environments, Autumn 1994, pp. 29-42. Also available at
http://www.usenix.org/publications/library/proceedings/osdi/kotz.html.
Abstract: Many scientific applications that run on
today's multiprocessors are bottlenecked by their file I/O needs. Even if the
multiprocessor is configured with sufficient I/O hardware, the file-system
software often fails to provide the available bandwidth to the application.
Although libraries and improved file-system interfaces can make a significant
improvement, we believe that fundamental changes are needed in the
file-server software. We propose a new technique, disk-directed I/O,
that flips the usual relationship between server and client to allow the
disks (actually, disk servers) to determine the flow of data for maximum
performance. Our simulations show that tremendous performance gains are
possible. Indeed, disk-directed I/O provided consistent high performance that
was largely independent of data distribution, and close to the maximum disk
bandwidth.
Keywords: parallel I/O, multiprocessor file
system, file system caching, dfk, pario-bib
Comment: Short version appeared in OSDI'94. Please
note that the revised tech report contains newer numbers than those in the
OSDI version, although the conclusions have not changed.
Abstract: Many scientific applications that run on
today's multiprocessors are bottlenecked by their file I/O needs. Even if the
multiprocessor is configured with sufficient I/O hardware, the file-system
software often fails to provide the available bandwidth to the application.
Although libraries and improved file-system interfaces can make a significant
improvement, we believe that fundamental changes are needed in the
file-server software. We propose a new technique, disk-directed I/O,
that flips the usual relationship between server and client to allow the
disks (actually, disk servers) to determine the flow of data for maximum
performance. Our simulations show that tremendous performance gains are
possible. Indeed, disk-directed I/O provided consistent high performance that
was largely independent of data distribution, and close to the maximum disk
bandwidth.
Keywords: parallel I/O, multiprocessor file
system, file system caching, pario-bib, dfk
SEE TECH
REPORT kotz:diskdir-tr. Please note that the tech report contains newer
numbers than those in the OSDI version, although the conclusions have not
changed. Comment: Same as kotz:diskdir.
Abstract: We sketch the reasons for the I/O
bottleneck in parallel and distributed systems, pointing out that it can be
viewed as a special case of a general bottleneck that arises at all levels of
the memory hierarchy. We argue that because of its severity, the I/O
bottleneck deserves systematic attention at all levels of system design. We
then present a survey of the issues raised by the I/O bottleneck in six key
areas of parallel and distributed systems: applications, algorithms,
languages and compilers, run-time libraries, operating systems, and
architecture.
Keywords: survey, parallel I/O, pario-bib, dfk
Abstract: As parallel computers are increasingly
used to run scientific applications with large data sets, and as processor
speeds continue to increase, it becomes more important to provide fast,
effective parallel file systems for data storage and for temporary files. In
an earlier work we demonstrated that a technique we call disk-directed I/O
has the potential to provide consistent high performance for large,
collective, structured I/O requests. In this paper we expand on this
potential by demonstrating the ability of a disk-directed I/O system to read
irregular subsets of data from a file, and to filter and distribute incoming
data according to data-dependent functions.
Keywords: parallel I/O, multiprocessor file
systems, dfk, pario-bib
Abstract: As parallel computers are increasingly
used to run scientific applications with large data sets, and as processor
speeds continue to increase, it becomes more important to provide fast,
effective parallel file systems for data storage and for temporary files. In
an earlier work we demonstrated that a technique we call disk-directed I/O
has the potential to provide consistent high performance for large,
collective, structured I/O requests. In this paper we expand on this
potential by demonstrating the ability of a disk-directed I/O system to read
irregular subsets of data from a file, and to filter and distribute incoming
data according to data-dependent functions.
Keywords: parallel I/O, multiprocessor file
systems, dfk, pario-bib
Of course, computational processes sharing a node
with a file-system service may receive less CPU time, network bandwidth, and
memory bandwidth than they would on a computation-only node. In this paper we
begin to examine this issue experimentally. We found that high-performance
I/O does not necessarily require substantial CPU time, leaving plenty of time
for application computation. There were some complex file-system requests,
however, which left little CPU time available to the application. (The impact
on network and memory bandwidth still needs to be determined.) For
applications (or users) that cannot tolerate an occasional interruption, we
recommend that they continue to use only compute nodes. For tolerant
applications needing more cycles than those provided by the compute nodes, we
recommend that they take full advantage of both compute and I/O nodes
for computation, and that operating systems should make this possible.
Abstract: As parallel systems move into the
production scientific-computing world, the emphasis will be on cost-effective
solutions that provide high throughput for a mix of applications.
Cost-effective solutions demand that a system make effective use of all of
its resources. Many MIMD multiprocessors today, however, distinguish between
``compute'' and ``I/O'' nodes, the latter having attached disks and being
dedicated to running the file-system server. This static division of
responsibilities simplifies system management but does not necessarily lead
to the best performance in workloads that need a different balance of
computation and I/O.
Keywords: parallel I/O, multiprocessor file
system, dfk, pario-bib
Abstract: Most MIMD multiprocessors today are
configured with two distinct types of processor nodes: those that have disks
attached, which are dedicated to file I/O, and those that do not have disks
attached, which are used for running applications. Several architectural
trends have led some to propose configuring systems so that all processors
are used for application processing, even those with disks attached. We
examine this idea experimentally, focusing on the impact of remote I/O
requests on local computational processes. We found that in an efficient file
system the I/O processors can transfer data at near peak speeds with little
CPU overhead, leaving substantial CPU power for running applications. On the
other hand, we found that some complex file-system features could require
substantial CPU overhead. Thus, for a multiprocessor system to obtain good
I/O and computational performance on a mix of applications, the file system
(both operating system and libraries) must be prepared to adapt their
policies to changing conditions.
Keywords: parallel I/O, multiprocessor file
system, dfk, pario-bib
We propose that the traditional functionality of
parallel file systems be separated into two components: a fixed core that is
standard on all platforms, encapsulating only primitive abstractions and
interfaces, and a set of high-level libraries to provide a variety of
abstractions and application-programmer interfaces (APIs). We think of this
approach as the ``RISC'' of parallel file-system design. We present our
current and next-generation file systems as examples of this structure. Their
features, such as a three-dimensional file structure, strided read and write
interfaces, and I/O-node programs, are specifically designed with the
flexibility and performance necessary to support a wide range of
applications. Abstract: Many scientific applications for
high-performance multiprocessors have tremendous I/O requirements. As a
result, the I/O system is often the limiting factor of application
performance. Several new parallel file systems have been developed in recent
years, each promising better performance for some class of parallel
applications. As we gain experience with parallel computing, and parallel
file systems in particular, it becomes increasingly clear that a single
solution does not suit all applications. For example, it appears to be
impossible to find a single appropriate interface, caching policy, file
structure, or disk management strategy. Furthermore, the proliferation of
file-system interfaces and abstractions make application portability a
significant problem.
Keywords: parallel I/O, multiprocessor file
system, dfk, pario-bib
Comment: A position paper.
We propose that the
traditional functionality of parallel file systems be separated into two
components: a fixed core that is standard on all platforms, encapsulating
only primitive abstractions and interfaces, and a set of high-level libraries
to provide a variety of abstractions and application-programmer interfaces
(APIs). We present our current and next-generation file systems as
examples of this structure. Their features, such as a three-dimensional file
structure, strided read and write interfaces, and I/O-node programs, re
specifically designed with the flexibility and performance necessary to
support a wide range of applications. Abstract: As we gain experience with parallel file
systems, it becomes increasingly clear that a single solution does not suit
all applications. For example, it appears to be impossible to find a single
appropriate interface, caching policy, file structure, or disk-management
strategy. Furthermore, the proliferation of file-system interfaces and
abstractions make applications difficult to port.
Keywords: parallel I/O, multiprocessor file
system, dfk, pario-bib
Comment: Nearly identical to kotz:flexibility. The
only changes are the format, a shorter abstract, and updates to Section 7 and
the references.
Abstract: Increasingly, file systems for
multiprocessors are designed with parallel access to multiple disks, to keep
I/O from becoming a serious bottleneck for parallel applications. Although
file system software can transparently provide high-performance access to
parallel disks, a new file system interface is needed to facilitate parallel
access to a file from a parallel application. We describe the difficulties
faced when using the conventional (Unix-like) interface in parallel
applications, and then outline ways to extend the conventional interface to
provide convenient access to the file for parallel programs, while retaining
the traditional interface for programs that have no need for explicitly
parallel file access. Our interface includes a single naming scheme, a
multiopen operation, local and global file pointers, mapped file pointers,
logical records, multifiles, and logical coercion for backward
compatibility.
Keywords: dfk, parallel I/O, multiprocessor file
system, file system interface, pario-bib
Comment: See also lake:pario for implementation of
some of the ideas.
Abstract: Increasingly, file systems for
multiprocessors are designed with parallel access to multiple disks, to keep
I/O from becoming a serious bottleneck for parallel applications. Although
file system software can transparently provide high-performance access to
parallel disks, a new file system interface is needed to facilitate parallel
access to a file from a parallel application. We describe the difficulties
faced when using the conventional (Unix-like) interface in parallel
applications, and then outline ways to extend the conventional interface to
provide convenient access to the file for parallel programs, while retaining
the traditional interface for programs that have no need for explicitly
parallel file access. Our interface includes a single naming scheme, a
multiopen operation, local and global file pointers, mapped file pointers,
logical records, multifiles, and logical coercion for backward
compatibility.
Keywords: dfk, parallel I/O, multiprocessor file
system, file system interface, pario-bib
Comment: See also lake:pario for implementation of
some of the ideas.
Keywords: dfk, parallel I/O, multiprocessor file
system, file system interface, pario-bib
Comment: Short paper (2 pages).
Abstract: In other papers I propose the idea of
disk-directed I/O for multiprocessor file systems. Those papers focus on the
performance advantages and capabilities of disk-directed I/O, but say little
about the application-programmer's interface or about the interface between
the compute processors and I/O processors. In this short note I discuss the
requirements for these interfaces, and look at many existing interfaces for
parallel file systems. I conclude that many of the existing interfaces could
be adapted for use in a disk-directed I/O system.
Keywords: disk-directed I/O, parallel I/O,
multiprocessor filesystem interfaces, pario-bib, dfk
Comment: See also kotz:jdiskdir, kotz:expand, and
kotz:lu.
Keywords: parallel I/O, multiprocessor file
system, dfk, pario-bib
Comment: A bibliography of many references on
parallel I/O and multiprocessor file-systems issues. As of the fifth edition,
it is available on the WWW in HTML format.
Abstract: Many scientific applications that run on
today's multiprocessors, such as weather forecasting and seismic analysis,
are bottlenecked by their file-I/O needs. Even if the multiprocessor is
configured with sufficient I/O hardware, the file-system software often fails
to provide the available bandwidth to the application. Although libraries and
enhanced file-system interfaces can make a significant improvement, we
believe that fundamental changes are needed in the file-server software. We
propose a new technique, disk-directed I/O, to allow the disk servers to
determine the flow of data for maximum performance. Our simulations show that
tremendous performance gains are possible both for simple reads and writes
and for an out-of-core application. Indeed, our disk-directed I/O technique
provided consistent high performance that was largely independent of data
distribution, obtained up to 93% of peak disk bandwidth, and was as much as
18 times faster than the traditional technique.
Keywords: parallel I/O, multiprocessor file
system, file system caching, dfk, pario-bib
Comment: This paper is a substantial revision of
the diskdir-tr version: all of the experiments have been re-done, using a
better-tuned version of the file systems (see kotz:tuning), and adding
two-phase I/O to all comparisons. It also incorporates some of the material
from kotz:expand and kotz:int-ddio. Also available at
http://www.acm.org/pubs/citations/journals/tocs/1997-15-1/p41-kotz/.
Abstract: Improvements in the processing speed of
multiprocessors are outpacing improvements in the speed of disk hardware.
Parallel disk I/O subsystems have been proposed as one way to close the gap
between processor and disk speeds. In a previous paper we showed that
prefetching and caching have the potential to deliver the performance
benefits of parallel file systems to parallel applications. In this paper we
describe experiments with practical prefetching policies that base decisions
only on on-line reference history, and that can be implemented efficiently.
We also test the ability of these policies across a range of architectural
parameters.
Keywords: dfk, parallel file system, prefetching,
disk caching, parallel I/O, MIMD, pario-bib
Comment: See also kotz:jwriteback, kotz:fsint2,
cormen:integrate.
Keywords: parallel file system, file access
pattern, multiprocessor file system workload, parallel I/O, pario-bib, dfk
Abstract: Improvements in the processing speed of
multiprocessors are outpacing improvements in the speed of disk hardware.
Parallel disk I/O subsystems have been proposed as one way to close the gap
between processor and disk speeds. Such parallel disk systems require
parallel file system software to avoid performance-limiting bottlenecks. We
discuss cache management techniques that can be used in a parallel file
system implementation for multiprocessors with scientific workloads. We
examine several writeback policies, and give results of experiments that test
their performance.
Keywords: dfk, parallel file system, disk caching,
parallel I/O, MIMD, pario-bib
Comment: See kotz:jpractical, kotz:fsint2,
cormen:integrate.
Abstract: New file systems are critical to obtain
good I/O performance on large multiprocessors. Several researchers have
suggested the use of collective file-system operations, in which all
processes in an application cooperate in each I/O request. Others have
suggested that the traditional low-level interface (read, write, seek)
be augmented with various higher-level requests (e.g., read matrix).
Collective, high-level requests permit a technique called disk-directed
I/O to significantly improve performance over traditional file systems and
interfaces, at least on simple I/O benchmarks. In this paper, we present the
results of experiments with an ``out-of-core'' LU-decomposition program.
Although its collective interface was awkward in some places, and forced
additional synchronization, disk-directed I/O was able to obtain much better
overall performance than the traditional system.
Keywords: parallel I/O, numerical analysis, dfk,
pario-bib
Abstract: New file systems are critical to obtain
good I/O performance on large multiprocessors. Several researchers have
suggested the use of collective file-system operations, in which all
processes in an application cooperate in each I/O request. Others have
suggested that the traditional low-level interface (read, write, seek)
be augmented with various higher-level requests (e.g., read matrix),
allowing the programmer to express a complex transfer in a single (perhaps
collective) request. Collective, high-level requests permit techniques like
two-phase I/O and disk-directed I/O to significantly improve
performance over traditional file systems and interfaces. Neither of these
techniques have been tested on anything other than simple benchmarks that
read or write matrices. Many applications, however, intersperse computation
and I/O to work with data sets that cannot fit in main memory. In this paper,
we present the results of experiments with an ``out-of-core''
LU-decomposition program, comparing a traditional interface and file system
with a system that has a high-level, collective interface and disk-directed
I/O. We found that a collective interface was awkward in some places, and
forced additional synchronization. Nonetheless, disk-directed I/O was able to
obtain much better performance than the traditional system.
Keywords: parallel I/O, numerical analysis, dfk,
pario-bib
Abstract: The computational performance of
multiprocessors continues to improve by leaps and bounds, fueled in part by
rapid improvements in processor and interconnection technology. I/O
performance thus becomes ever more critical, to avoid becoming the bottleneck
of system performance. In this paper we provide an introduction to I/O
architectural issues in multiprocessors, with a focus on disk subsystems.
While we discuss examples from actual architectures and provide pointers to
interesting research in the literature, we do not attempt to provide a
comprehensive survey. We concentrate on a study of the architectural design
issues, and the effects of different design alternatives.
Keywords: parallel I/O, multiprocessor file
system, pario-bib, dfk
Comment: Invited paper. Part of a whole book on
parallel I/O; see iopads-book.
Abstract: Parallel disk subsystems have been
proposed as one way to close the gap between processor and disk speeds. In a
previous paper we showed that prefetching and caching have the potential to
deliver the performance benefits of parallel file systems to parallel
applications. In this paper we describe experiments with practical
prefetching policies, and show that prefetching can be implemented
efficiently even for the more complex parallel file access patterns. We test
these policies across a range of architectural parameters.
Keywords: dfk, parallel file system, prefetching,
disk caching, parallel I/O, MIMD, OS93W extra, OS92W, pario-bib
Comment: Short form of primary thesis results. See
kotz:jwriteback, kotz:fsint2, cormen:integrate.
Experiments have
been conducted with an interleaved file system testbed on the Butterfly Plus
multiprocessor. Results of these experiments suggest that 1) the hit ratio,
the accepted measure in traditional caching studies, may not be an adequate
measure of performance when the workload consists of parallel computations
and parallel file access patterns, 2) caching with prefetching can
significantly improve the hit ratio and the average time to perform an I/O
operation, and 3) an improvement in overall execution time has been observed
in most cases. In spite of these gains, prefetching sometimes results in
increased execution times (a negative result, given the optimistic nature of
the study). We explore why is it not trivial to translate savings on
individual I/O requests into consistently better overall performance and
identify the key problems that need to be addressed in order to improve the
potential of prefetching techniques in this environment. Abstract: The problem of providing file I/O to
parallel programs has been largely neglected in the development of
multiprocessor systems. There are two essential elements of any file system
design intended for a highly parallel environment: parallel I/O and effective
caching schemes. This paper concentrates on the second aspect of file system
design and specifically, on the question of whether prefetching blocks of the
file into the block cache can effectively reduce overall execution time of a
parallel computation, even under favorable assumptions.
Keywords: dfk, parallel file system, prefetching,
MIMD, disk caching, parallel I/O, pario-bib
This dissertation studies some of the file system issues
needed to get high performance from parallel disk systems, since parallel
hardware alone cannot guarantee good performance. The target systems are
large MIMD multiprocessors used for scientific applications, with large files
spread over multiple disks attached in parallel. The focus is on automatic
caching and prefetching techniques. We show that caching and prefetching can
transparently provide the power of parallel disk hardware to both sequential
and parallel applications using a conventional file system interface. We also
propose a new file system interface (compatible with the conventional
interface) that could make it easier to use parallel disks effectively.
Our methodology is a mixture of implementation and simulation, using a
software testbed that we built to run on a BBN GP1000 multiprocessor. The
testbed simulates the disks and fully implements the caching and prefetching
policies. Using a synthetic workload as input, we use the testbed in an
extensive set of experiments. The results show that prefetching and caching
improved the performance of parallel file systems, often dramatically.
Abstract: The increasing speed of the most
powerful computers, especially multiprocessors, makes it difficult to provide
sufficient I/O bandwidth to keep them running at full speed for the largest
problems. Trends show that the difference in the speed of disk hardware and
the speed of processors is increasing, with I/O severely limiting the
performance of otherwise fast machines. This widening access-time gap is
known as the ``I/O bottleneck crisis.'' One solution to the crisis, suggested
by many researchers, is to use many disks in parallel to increase the overall
bandwidth.
Keywords: dfk, parallel file system, prefetching,
MIMD, disk caching, parallel I/O, pario-bib
Comment: Published as kotz:prefetch,
kotz:jwriteback, kotz:jpractical, kotz:fsint2.
Keywords: parallel I/O, multiprocessor file
system, performance, survey, dfk, pario-bib
Comment: A brief note on the reported performance
of existing file systems (Intel CFS, nCUBE, CM-2, CM-5, and Cray). Many have
disappointingly low absolute throughput, in MB/s.
Abstract: STARFISH is a parallel file-system
simulator we built for our research into the concept of disk-directed I/O. In
this report, we detail steps taken to tune the file systems supported by
STARFISH, which include a traditional parallel file system (with caching) and
a disk-directed I/O system. In particular, we now support two-phase I/O, use
smarter disk scheduling, increased the maximum number of outstanding requests
that a compute processor may make to each disk, and added gather/scatter
block transfer. We also present results of the experiments driving the tuning
effort.
Keywords: parallel I/O, multiprocessor file
system, dfk, pario-bib
Comment: Reports on some new changes to the
STARFISH simulator that implements traditional caching and disk-directed I/O.
This is meant mainly as a companion to kotz:jdiskdir. See also kotz:jdiskdir,
kotz:diskdir, kotz:expand.
Most successful
systems are based on a solid understanding of the characteristics of the
expected workload, but until now there have been no comprehensive workload
characterizations of multiprocessor file systems. We began the CHARISMA
project in an attempt to fill that gap. We instrumented the common node
library on the iPSC/860 at NASA Ames to record all file-related activity over
a two-week period. Our instrumentation is different from previous efforts in
that it collects information about every read and write request and about the
mix of jobs running in the machine (rather than from selected
applications). The trace analysis in this paper leads to many
recommendations for designers of multiprocessor file systems. First, the file
system should support simultaneous access to many different files by many
jobs. Second, it should expect to see many small requests, predominantly
sequential and regular access patterns (although of a different form than in
uniprocessors), little or no concurrent file-sharing between jobs,
significant byte- and block-sharing between processes within jobs, and strong
interprocess locality. Third, our trace-driven simulations showed that these
characteristics led to great success in caching, both at the compute nodes
and at the I/O nodes. Finally, we recommend supporting strided I/O requests
in the file-system interface, to reduce overhead and allow more performance
optimization by the file system. Abstract: Multiprocessors have permitted
astounding increases in computational performance, but many cannot meet the
intense I/O requirements of some scientific applications. An important
component of any solution to this I/O bottleneck is a parallel file system
that can provide high-bandwidth access to tremendous amounts of data in
parallel to hundreds or thousands of processors.
Keywords: parallel file system, file access
pattern, multiprocessor file system workload, parallel I/O, pario-bib, dfk
Comment: Also at
http://www.acm.org/pubs/citations/proceedings/supercomputing/198354/p640-kotz
and http://computer.org/conferen/sc94/kotz.html
Most successful
systems are based on a solid understanding of the characteristics of the
expected workload, but until now there have been no comprehensive workload
characterizations of multiprocessor file systems. We began the CHARISMA
project in an attempt to fill that gap. We instrumented the common node
library on the iPSC/860 at NASA Ames to record all file-related activity over
a two-week period. Our instrumentation is different from previous efforts in
that it collects information about every read and write request and about the
mix of jobs running in the machine (rather than from selected
applications). The trace analysis in this paper leads to many
recommendations for designers of multiprocessor file systems. First, the file
system should support simultaneous access to many different files by many
jobs. Second, it should expect to see many small requests, predominantly
sequential and regular access patterns (although of a different form than in
uniprocessors), little or no concurrent file-sharing between jobs,
significant byte- and block-sharing between processes within jobs, and strong
interprocess locality. Third, our trace-driven simulations showed that these
characteristics led to great success in caching, both at the compute nodes
and at the I/O nodes. Finally, we recommend supporting strided I/O requests
in the file-system interface, to reduce overhead and allow more performance
optimization by the file system. Abstract: Multiprocessors have permitted
astounding increases in computational performance, but many cannot meet the
intense I/O requirements of some scientific applications. An important
component of any solution to this I/O bottleneck is a parallel file system
that can provide high-bandwidth access to tremendous amounts of data in
parallel to hundreds or thousands of processors.
Keywords: parallel file system, file access
pattern, multiprocessor file system workload, parallel I/O, pario-bib, dfk
Abstract: Improvements in the processing speed of
multiprocessors are outpacing improvements in the speed of disk hardware.
Parallel disk I/O subsystems have been proposed as one way to close the gap
between processor and disk speeds. Such parallel disk systems require
parallel file system software to avoid performance-limiting bottlenecks. We
discuss cache management techniques that can be used in a parallel file
system implementation. We examine several writeback policies, and give
results of experiments that test their performance.
Keywords: dfk, parallel file system, disk caching,
parallel I/O, MIMD, pario-bib
Comment: See also kotz:jpractical, kotz:fsint2,
cormen:integrate.
Abstract: The most frequently used part of MPI-2
is MPI I/O. Due to the complexity of parallel programming in general, and of
handling parallel I/O in particular, there is a need for tools that support
the application development process. There axe many situations where
incorrect usage of MPI by the application programmer can be automatically
detected. In this paper we describe the MARMOT tool that uncovers some of
these errors and we also analyze to what extent it is possible to do so for
MPI I/O.
Keywords: MPI I/O, error detection, performance
analysis, MARMOT, pario-bib
Keywords: memory-mapped file, file system,
parallel I/O, pario-bib
We show that on several Unix workstation
platforms the performance of Unix applications using the Alloc Stream
Facility can be substantially better that when the applications use the
original I/O facilities. Abstract: This paper describes the design and
implementation of a new application level I/O facility, called the Alloc
Stream Facility. The Alloc Stream Facility has several key advantages. First,
performance is substantially improved as a result of a) the structure of the
facility that allows it to take advantage of system specific features like
mapped files, and b) a reduction in data copying and the number of I/O system
calls. Second, the facility is designed for multi-threaded applications
running on multiprocessors and allows for a high degree of concurrency.
Finally, the facility can support a variety of I/O interfaces, including
stdio, emulated Unix I/O, ASI, and C++ streams, in a way that allows
applications to freely intermix calls to the different interfaces, resulting
in improved code reusability.
Keywords: memory-mapped file, file system,
parallel I/O, pario-bib
Comment: See also krieger:mapped. ``This is an
extended version of the paper with the same title in the March, 1994 edition
of IEEE Computer.'' A 3-level interface structure: interface, backplane, and
stream-specific modules. Different interfaces available: unix, stdio, ASI
(theirs), C++. Common backplane. Stream-specific implementations that export
operations like salloc and sfree, which return pointers to data buffers. ASI
exports that interface to the user, for maximum efficiency. Performance is
best when using mapped files as underlying implementation. Many stdio or unix
apps are faster only after relinking. ASI is even faster. In addition to
better performance, also get multithreading support, multiple interfaces, and
extensibility.
Abstract: The {H{\sc urricane}} File System (HFS)
is a new file system being developed for large-scale shared memory
multiprocessors with distributed disks. The main goal of this file system is
scalability; that is, the file system is designed to handle demands that are
expected to grow linearly with the number of processors in the system. To
achieve this goal, HFS is designed using a new structuring technique called
Hierarchical Clustering. HFS is also designed to be flexible in supporting a
variety of policies for managing file data and for managing file system
state. This flexibility is necessary to support in a scalable fashion the
diverse workloads we expect for a multiprocessor file system.
Keywords: multiprocessor file system, parallel
I/O, operating system, shared memory, pario-bib
Comment: This paper is now out of date; see
krieger:thesis. Designed for scalability on the hierarchical clustering model
(see unrau:cluster), the Hurricane File System for NUMA shared-memory MIMD
machines. Each cluster has its own full file system, which communicates with
those in other clusters. Pieces are name server, open-file server, and
block-file server. On first access, the file is mapped into the application
space. VM system calls BFS to arrange transfers. Open questions: policies for
file state management, block distribution, caching, and prefetching.
Object-oriented approach used to allow for flexibility and extendability.
Local disk file systems are log-structured.
We have implemented HFS as part of the Hurricane operating system
running on the Hector shared memory multiprocessor. We demonstrate that the
flexibility of HFS comes with little processing or I/O overhead. We also show
that for a number of file access patterns HFS is able to deliver to the
applications the full I/O bandwidth of the disks on our system. Abstract: The Hurricane File System (HFS) is
designed for (potentially large-scale) shared memory multiprocessors. Its
architecture is based on the principle that, in order to maximize performance
for applications with diverse requirements, a file system must support a wide
variety of file structures, file system policies and I/O interfaces. Files in
HFS are implemented using simple building blocks composed in potentially
complex ways. This approach yields great flexibility, allowing an application
to customize the structure and policies of a file to exactly meet its
requirements. For example, a file's structure can be optimized for concurrent
random-access write-only operations by ten processes. Similarly, the
prefetching, locking, and file cache management policies can all be chosen to
match an application's access pattern. In contrast, most existing parallel
file systems support a single file structure and a small set of policies.
Keywords: parallel I/O, parallel file system,
object-oriented, pario-bib
Comment: A published form of krieger:hfs and the
thesis krieger:thesis. Their main point is that the file system is
constructed from building-block objects. When you create a file you choose a
few building blocks, for example, a replication block that mirrors the file,
and some distribution blocks that distribute each replica across a set of
disks. When you open the file you plug in some more building blocks, e.g., to
do prefetching or to provide the kind of interface that you want to use. They
point out that this flexibility is critical to be able to get good
performance, because different file-access patterns need different structures
and policies. They found that mapped files minimize copying costs and improve
performance. They were able to obtain full disk bandwidth. Great paper.
Abstract: The Hurricane File System (HFS) is
designed for (potentially large-scale) shared-memory multiprocessors. Its
architecture is based on the principle that, in order to maximize performance
for applications with diverse requirements, a file system must support a wide
variety of file structures, file system policies, and I/O interfaces. Files
in HFS are implemented using simple building blocks composed in potentially
complex ways. This approach yields great flexibility, allowing an application
to customize the structure and policies of a file to exactly meet its
requirements. As an extreme example, HFS allows a file's structure to be
optimized for concurrent random-access write-only operations by 10 threads,
something no other file system can do. Similarly, the prefetching, locking,
and file cache management policies can all be chosen to match an
application's access pattern. In contrast, most parallel file systems support
a single file structure and a small set of policies. We have implemented HFS
as part of the Hurricane operating system running on the Hector shared-memory
multiprocessor. We demonstrate that the flexibility of HFS comes with little
processing or I/O overhead. We also show that for a number of file access
patterns, HFS is able to deliver to the applications the full I/O bandwidth
of the disks on our system.
Keywords: parallel I/O, parallel file system,
object-oriented, pario-bib
We have implemented large portions of HFS as part of the Hurricane operating
system running on the Hector shared-memory multiprocessor. We demonstrate
that the flexibility of HFS comes with little processing or I/O overhead.
Also, we show that HFS is able to deliver the full I/O bandwidth of the disks
on our system to the applications. Abstract: The Hurricane File System (HFS) is
designed for large-scale, shared-memory multiprocessors. Its architecture is
based on the principle that a file system must support a wide variety of file
structures, file system policies and I/O interfaces to maximize performance
for a wide variety of applications. HFS uses a novel, object-oriented
building-block approach to provide the flexibility needed to support this
variety of file structures, policies, and I/O interfaces. File structures can
be defined in HFS that optimize for sequential or random access, read-only,
write-only or read/write access, sparse or dense data, large or small file
sizes, and different degrees of application concurrency. Policies that can be
defined on a per-file or per-open instance basis include locking policies,
prefetching policies, compression/decompression policies and file cache
management policies. In contrast, most existing file systems have been
designed to support a single file structure and a small set of policies.
Keywords: parallel I/O, multiprocesor file system,
shared memory, memory-mapped I/O, pario-bib
Comment: Excellent work. HFS uses an
object-oriented building-block approach to provide flexible, scalable high
performance. Indeed, HFS appears to be one of the most flexible parallel file
systems available, allowing users to independently control (or redefine)
policies for prefetching, caching, redundancy and fault tolerance, and
declustering.
Keywords: parallel I/O, parallel file system,
parallel I/O, performance measurement, pario-bib
Comment: Short measurements of CM-2 Datavault.
Faster if you access through Paris. Can get nearly full 32 MB/s bandwidth.
Problem in its ability to use multiple CMIO busses.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: Essentially a (short) combination of
krystynak:datavault and nitzberg:cfs.
Keywords: parallel file system interface,
pario-bib
Comment: Experience making libc reentrant, adding
semaphores, etc., on a Convex. Some problems with I/O. Added semaphores and
private memory to make libc calls reentrant, i.e., callable in parallel by
multiple threads.
Keywords: parallel I/O, pario-bib
Comment: A visualization tool, now long gone, for
display of CHARISMA trace files. See nieuwejaar:workload for details of
CHARISMA.
Abstract: The paper summarizes our experiences
using the Panda parallel I/O library with the H3expresso numerical relativity
code on the Cornell SP2. Two performance issues are described: providing
efficient off-loading of output data, and satisfying users' desire to
dedicate fewer nodes to I/O. We explore the tradeoffs between potential
solutions, and present performance results for our approaches. We also show
that Panda's high level interface, which allows the user to request input or
output of a set of arrays with a single command, is a good match for
H3expresso's needs
Keywords: application experience, parallel
input/output, parallel I/O, performance issues, multiprocessor file system
interface, pario-bib
Abstract: Large simulations which run for hundreds
of hours on parallel computers often periodically generate snapshots of
states, which are later post-processed to visualize the simulated physical
phenomenon. For many applications, fast I/O during post-processing, which is
dependent on an efficient organization of data on disk, is as important as
minimizing computation-time I/O. In this paper we propose optimizations to
support efficient parallel I/O for scientific simulations and subsequent
visualizations. We present an ordering mechanism to linearize data on disk, a
performance model to help to choose a proper stripe unit size, and a
scheduling algorithm to minimize communication contention. Our experiments on
an IBM SP show that the combination of these strategies provides a 20-25%
performance boost.
Keywords: scientific computing, simulation,
parallel I/O, pario-bib
Keywords: scientific applications, query-based
interface, parallel I/O, pario-bib
Comment: They describe an architecture for
accessing data in scientific datasets by performing range queries (a
multidimensional bounding box) over the data. This type of access mechanism
is useful for applications like satellite imaging.
Keywords: parallel I/O, parallel architecture,
multiprocessor file system, pario-bib
Comment: They measure the performance of the CM-5
Scalable File System using synthetic benchmarks. They compare CM-Fortran with
CMMD. The hardware-dependent (``physical'') modes were much faster than the
generic-format modes, which have to reorder data between the processor
distribution and the disk distribution. The network turned out to be a
bottleneck for the performance when reordering was needed. They conclude that
more user control over the I/O would be very helpful.
Keywords: parallel I/O, sorting, pario-bib
Comment: Examines external sorting techniques such
as merge sort, tag sort, multi-pass distribution sort, and one-pass
distribution sort. The model is one where I/O complexity is included,
assuming a linear seek time distribution and a cost of 1/2 rotation for each
seek. Parallel I/O or computing are not considered until the distribution
sorts. Architectural model on page 58.
Abstract: Recent studies have demonstrated that a
significant number of I/O operations are performed by a number of classes of
different parallel applications. Appropriate I/O management strategies are
required however for harnessing the power of parallel I/O. This paper focuses
on two I/O management issues that affect system performance in
multiprogrammed parallel environments. Characterization of I/O behavior of
parallel applications in terms of four different models is discussed first,
followed by an investigation of the performance of a number of different data
distribution strategies. Using computer simulations this research shows that
I/O characteristics of applications and data distribution have an important
effect on system performance. Applications that can simultaneously do
computation and I/O, plus strategies that can incorporate centralized I/O
management are found to be beneficial for a multiprogrammed parallel
environment.
Keywords: parallel I/O, pario-bib
Comment: See majumdar:management.
Abstract: Recent studies have demonstrated that
significant I/O is performed by a number of parallel applications. In
addition to running these applications on multiple processors, the
parallelization of I/O operations and the use of multiple disk drives are
required for achieving high system performance. This research is concerned
with the effective management of parallel I/O by using appropriate I/O
scheduling strategies. Based on a simulation model the performance of a
number of scheduling policies are investigated. Using I/O characteristics of
jobs such as the total outstanding I/O demand is observed to be useful in
devising effective scheduling strategies.
Keywords: parallel I/O, scheduling, pario-bib
Keywords: parallel I/O, MIMD, multiprocessor file
system, pario-bib
Comment: They describe the I/O system for the
Myrias SPS-3 parallel computer. The SPS is a no-remote-access (NORMA) machine
with a software shared memory abstraction. They provide a standard C/FORTRAN
I/O interface, with a few extensions. The user's parallel program is
considered a client, and an I/O processor (IOP) is the server. No striping
across IOPs, which makes it relatively simple for them to have the server
manage the shared file pointer. Their extensions allow atomic, file-pointer
update, returning the actual position where I/O occurred, and atomic access
to fixed- and variable-length records. They have three protocols, for
different transfer sizes; small using simple request/response; medium using
sliding window; and large using scatter/gather and special hardware double
buffering at the IOP. They use scatter/gather DMA, and page-table fiddling,
for messaging. Performance is 89-96% of hardware peak, limited by IOP's VME
backplane.
Keywords: parallel I/O, algorithms, pario-bib
Abstract: As the number of nodes in cluster
systems continues to grow, leveraging scalable algorithms in all aspects of
such systems becomes key to maintaining performance. While scalable
algorithms have been applied successfully in some areas of parallel I/O, many
operations are still performed in an uncoordinated manner. In this work we
consider, in three file system scenarios, the possibilities for applying
scalable algorithms to the many operations that make up the MPI-IO interface.
From this evaluation we extract a set of file system characteristics that aid
in developing scalable MPI-IO implementations.
Keywords: scalability analysis, MPI-IO, pario-bib
Keywords: pvfs2, parallel file system, pario-bib
Abstract: The paper describes a new
interconnection network for massively parallel systems, referred to as
star-connected cycles (SCC). The SCC graph presents an I/O-bounded structure
that results in several advantages over variable degree graphs like the star
and the hypercube. The description of the SCC graph includes issues such as
labelling of nodes, degree, diameter and symmetry. The paper also presents an
optimal routeing algorithm for the SCC and efficient broadcasting algorithms
with O(n) running time, with n being the dimensionality of the graph. A
comparison with the cube-connected cycles (CCC) and other interconnection
networks is included, indicating that, for even n, an n-SCC and a CCC of
similar sizes have about the same diameter. In addition, it is shown that
one-port broadcasting in an n-SCC graph can be accomplished with a running
time better than or equal to that required by an n-star containing (n-1)
times fewer nodes.
Keywords: parallel I/O, parallel computer
architecture, pario-bib
Keywords: srb, performance-related optimization,
pario, pario-bib
Comment: SRB data transfer optimization on cluster
storage servers. If disk-bound, the system transfers from server to disks are
broken so that protocol processing and disk transfer are pipelined. If
network bound, stripe transfer from multiple clients to multiple servers. No
mention of remote execution.
Keywords: distributed file system, multiprocessor
file system, pario-bib
Comment: See also broom:acacia, broom:impl,
mutisya:cache, and broom:cap. The Acacia file system has file access modes
that are much like those in Intel CFS and TMC CMMD. By default all processes
have their own file pointer, but they can switch to another mode either all
together or in row- or column-subsets. The other modes include a replicated
mode (where all read or write the same data), and a variety of shared modes,
with arbitrary, fixed, or unspecified ordering among processors, and with
fixed or variable-sized records. They also have a parallel-open operation,
support for logical records, control over the striping width (number of
disks) and height (block size), and control over of redundancy. A prototype
is running.
Keywords: parallel I/O, disk array, RAID,
pario-bib
Comment: An early paper, perhaps the earliest,
that describes the techniques that later became RAID. Lawlor notes how to use
parity to recover data lost due to disk crash, as in RAID3, addresses the
read-before-write problem by caching the old data block as well as the new
data block, and shows how two-dimensional parity can protect against two or
more failures.
Keywords: RAID, disk array, reliability, parallel
I/O, pario-bib
Comment: Part of jin:io-book; reformatted version
of lee:jparity.
Keywords: parallel I/O, distributed file system,
declustering, reliability, pario-bib
Comment: Part of jin:io-book; reformatted version
of lee:petal.
Keywords: parallel I/O, file prefetching,
pario-bib
Keywords: parallel I/O, pario-bib
Abstract: We address the problem of assigning
nonpartitioned files in a parallel I/O system where the file accesses exhibit
Poisson arrival rates and fixed service times. We present two new file
assignment algorithms based on open queuing networks which aim at minimizing
simultaneously the load balance across all disks, as well as the variance of
the service time at each disk. We first present an off-line algorithm, Sort
Partition, which assigns to each disk files with similar access time. Next,
we show that, assuming that a perfectly balanced file assignment can be found
for a given set of files, Sort Partition will find the one with minimal mean
response time. We then present an on-line algorithm, Hybrid Partition, that
assigns groups of files with similar service times in successive intervals
while guaranteeing that the load imbalance at any point does not exceed a
certain threshold. We report on synthetic experiments which exhibit skew in
file accesses and sizes and we compare the performance of our new algorithms
with the vanilla greedy file allocation algorithm.
Keywords: parallel I/O, parallel file system,
pario-bib
Keywords: parallel I/O, disk striping,
performance, pario-bib
Comment: Details of their prototype. Defines terms
like stripe unit. Explores ways to lay out parity. Does performance
simulations. Describes ops needed in device driver. Good to read if you plan
to implement a RAID. Results: small R+W, or high loads, don't care about
parity placement; in low load, there are different best cases for large R+W.
Best all-around is left-symmetric. See also lee:parity.
Keywords: RAID, reliability, parallel I/O, disk
striping, pario-bib
Comment: Journal version of lee:parity.
Abstract: In this paper we propose
user-controllable I/O operations and explore the effects of them with some
synthetic access patterns. The operations allow users to determine a file
structure matching the access patterns, control the layout and distribution
of data blocks on physical disks, and present various access patterns with a
minimum number of I/O operations. The operations do not use a file pointer to
access data as in typical file systems, which eliminates the overhead of
managing the offset of the file, making it easy to share data and reducing
the number of I/O operations.
Keywords: logical disks, parallel I/O, pario-bib
Keywords: parallel I/O, prefetching, disk array,
pario-bib
Keywords: RAID, disk array, reliability, parallel
I/O, pario-bib
Comment: Interesting comparison of several parity
placement schemes. Boils down to two basic choices, depending on whether read
performance or write performance is more important to you.
Keywords: parallel I/O, distributed file system,
declustering, reliability, pario-bib
Comment: They are trying to build a file server
that is easier to manage than most of today's distributed file systems,
because disks are cheap but management is expensive. They describe a
distributed file server that spreads blocks of all files across many disks
and many servers. They use chained declustering so that they can survive loss
of server or disk. They dynamically balance load. They dynamically
reconfigure when new virtual disks are created or new physical disks are
added. They've built it all and are now going to look at possible file
systems that can take advantage of the features of Petal.
Keywords: disk array, parallel I/O, RAID, analytic
model, pario-bib
Abstract: This paper presents
communication-efficient algorithms for the external data redistribution
problem. Deterministic lower bounds and upper bounds are presented for the
number of I/O operations, communication time and the memory requirements of
external redistribution. Our algorithms differ from most other algorithms
presented for out-of-core applications in that it is optimal (within a small
constant factor) not only in the number of I/O operations, but also in the
time taken for communication. A coarse-grained MIMD architecture with I/O
subsystems attached to each processor is assumed, but the results are
expected to be applicable over a wider variety of architectures.
Keywords: parallel I/O algorithm, out-of-core,
pario-bib
Comment: See shankar:transport for the underlying
communication primitives.
Abstract: The paper describes a parallel file
object environment to support distributed array store on shared nothing
distributed computing environments. Our environment enables programmers to
extend the concept of array distribution from memory levels to file levels.
It allows parallel I/O according to the distribution of objects in an
application. When objects are read and/or written by multiple applications
using different distributions, we present a novel scheme to help programmers
to select the best data distribution pattern according to minimum amount of
remote data movements for the store of array objects on distributed file
systems.
Keywords: parallel I/O, object oriented,
distributed memory, pario-bib
Abstract: This paper presents the design of UPIO,
a software for user-controllable parallel input and output. UPIO is designed
to maximize I/O performance for scientific applications on MIMD
multicomputers. The most important features of UPIO are: It supports a
domain-specific file model and a variety of application interfaces to present
numerous access patterns. UPIO provides user-contollerable I/O operations
that allow users to control data access, file structure, and data
distribution. The domain-specific file model and user controllability give
low I/O overhead and allow programmers to exploit the aggregate bandwidth of
parallel disks.
Keywords: parallel I/O, pario-bib
Comment: They describe an interface that seems to
allow easier access for programmers that want to map matrices onto parallel
files. The concepts are not well explained, so it's hard to really understand
what is new and different. They make no explicit comparison with other
advanced interfaces like that in Vesta or Galley. No performance results.
Since most algorithms do not take into
account the size of main memory, new algorithms have been developed to
optimize the number of I/O's performed. This paper details the implementation
of one such algorithm, for external-memory depth-first search.
Depth-first search is a basic tool for solving many problems in graph theory,
and since graph theory is applicable for many large computational problems,
it is important to make sure that such a basic tool is designed to avoid the
bottleneck of main memory to secondary storage I/O's. The algorithm
whose implementation is described in this paper is sketched out in an
extended abstract by Chiang et al. We attempt to improve the given algorithm
by minimizing I/O's performed, and to extend the algorithm by finding
disjoint trees, and by classifying all the edges in the problem.
Abstract: In many different areas of computing,
problems can arise which are too large to fit in main memory. For these
problems, the I/O cost of moving data between main memory and secondary
storage (for example, disks) becomes a significant bottleneck affecting the
performance of the program.
Keywords: out-of-core algorithm, parallel I/O,
pario-bib
Comment: Senior honors thesis. Advisor: Tom
Cormen.
Abstract: The paper shows the implementation of a
3D simulation code for turbulent flow and combustion processes in full-scale
utility boilers on an Intel Paragon XP/S computer. For the portable
parallelization, an explicit approach is chosen using a domain decomposition
method for the static subdivision of the numerical grid together with the
SPMD programming model. The measured speedup for the presented case using a
coarse grid is good, although some numerical requirements restrict the
implemented message passing to strongly synchronized communication. On the
Paragon, the NX message passing library is used for the computations.
Furthermore, MPI and PVM are applied and their pros and cons on this computer
are described. In addition to the basic message passing techniques for local
and global communication, other possibilities are investigated. Besides the
applicability of the vectorizing capability of the compiler, the influence of
the I/O performance during computations is demonstrated. The scalability of
the parallel application is presented for a refined discretization.
Keywords: parallel I/O, application, pario-bib
Keywords: parallel file system, distributed shared
memory, DSM, COMA, pario-bib
Comment: Basically, cooperative shared memory with
a backing store.
Keywords: parallel I/O algorithm, pario-bib
Abstract: This paper presents a framework of using
resource metrics to characterize the various models of parallel
computation. Our framework reflects the approach of recent models to abstract
architectural details into several generic parameters, which we call resource
metrics. We examine the different resource metrics chosen by different
parallel models, categorizing the models into four classes: the basic
synchronous models, and extensions of the basic models which more accurately
reflect practical machines by incorporating notions of asynchrony,
communication cost and memory hierarchy. We then present a new parallel
computation model, the LogP-HMM model, as an illustration of design
principles based on the framework of resource metrics. The LogP-HMM model
extends an existing parameterized network model (LogP) with a sequential
hierarchical memory model (HMM) characterizing each processor. The result
accurately captures both network communication costs and the effects of
multileveled memory such as local cache and I/O. We examine the potential
utility of our model in the design of near optimal sorting and FFT
algorithms.
Keywords: parallel I/O algorithm, pario-bib
We first introduce tensor
bases to capture the semantics of block-cyclic data distributions of
out-of-core data and also data access patterns to out-of-core data. We then
present program generation techniques for tensor products and matrix
transposition. We accurately represent the number of parallel I/O operations
required for the synthesized programs for tensor products and matrix
transposition as a function of tensor bases and data distributions. We
introduce an algorithm to determine the data distribution which optimizes the
performance of the synthesized programs. Further, we formalize the procedure
of synthesizing efficient out-of-core programs for tensor product formulas
with various block-cyclic distributions as a dynamic programming problem.
We demonstrate the effectiveness of our approach through several
examples. We show that the choice of an appropriate data distribution can
reduce the number of passes to access out-of-core data by as large as eight
times for a tensor product, and the dynamic programming approach can largely
reduce the number of passes to access out-of-core data for the overall tensor
product formulas. Abstract: In this paper, we present a framework
for synthesizing I/O efficient out-of-core programs for block recursive
algorithms, such as the fast Fourier transform (FFT) and block matrix
transposition algorithms. Our framework uses an algebraic representation
which is based on tensor products and other matrix operations. The programs
are optimized for the striped Vitter and Shriver's two-level memory model in
which data can be distributed using various cyclic(B) distributions in
contrast to the normally used physical track distribution cyclic(B_d),
where B_d is the physical disk block size.
Keywords: parallel I/O, out-of-core algorithm,
pario-bib
Abstract: This paper presents a framework for
synthesizing I/O-efficient out-of-core programs for block recursive
algorithms, such as the fast Fourier transform and matrix transpositions. the
programs are synthesized from tensor (Kronecker) product representations of
algorithms. These programs are optimized for a striped two-level memory model
where in the out-of-core data can have block-cyclic distributions on multiple
disks.
Keywords: parallel I/O algorithm, pario-bib
We first introduce tensor
bases to capture the semantics of block-cyclic data distributions of
out-of-core data and also data access patterns to out-of-core data. We then
present program generation techniques for tensor products and matrix
transposition. We accurately represent the number of parallel I/O operations
required for the synthesized programs for tensor products and matrix
transposition as a function of tensor bases and data distributions. We
introduce an algorithm to determine the data distribution which optimizes the
performance of the synthesized programs. Further, we formalize the procedure
of synthesizing efficient out-of-core programs for tensor product formulas
with various block-cyclic distributions as a dynamic programming problem.
We demonstrate the effectiveness of our approach through several
examples. We show that the choice of an appropriate data distribution can
reduce the number of passes to access out-of-core data by as large as eight
times for a tensor product, and the dynamic programming approach can largely
reduce the number of passes to access out-of-core data for the overall tensor
product formulas. Abstract: In this paper, we present a framework
for synthesizing I/O efficient out-of-core programs for block recursive
algorithms, such as the fast Fourier transform (FFT) and block matrix
transposition algorithms. Our framework uses an algebraic representation
which is based on tensor products and other matrix operations. The programs
are optimized for the striped Vitter and Shriver's two-level memory model in
which data can be distributed using various cyclic(B) distributions in
contrast to the normally used physical track distribution cyclic(B_d),
where B_d is the physical disk block size.
Keywords: parallel I/O algorithm, pario-bib
Abstract: For concurrent I/O operations, atomicity
defines the results in the overlapping file regions simultaneously
read/written by requesting processes. Atomicity has been well studied at the
file system level, such as POSIX standard. In this paper, we investigate the
problems arising from the implementation of MPI atomicity for concurrent
overlapping write access and provide two programming solutions. Since the MPI
definition of atomicity differs from the POSIX one, an implementation that
simply relies on the POSIX file systems does not guarantee correct MPI
semantics. To have a correct implementation of atomic I/O in MPI, we examine
the efficiency of three approaches: 1) file locking, 2) graph-coloring, and
3) process-rank ordering. Performance complexity for these methods are
analyzed and their experimental results are presented for file systems
including NFS, SGI's XFS, and IBM's GPFS.
Keywords: MPI, concurrent I/O operations,
overlapping write access, atomic I/O operations, pario-bib
Abstract: Dedicated cluster parallel computers
(DCPCs) are emerging as low-cost high performance environments for many
important applications in science and engineering. A significant class of
applications that perform well on a DCPC are coarse-grain applications that
involve large amounts of file I/O. Current research in parallel file systems
for distributed systems is providing a mechanism for adapting these
applications to the DCPC environment. We present the Parallel Virtual File
System (PVFS), a system that provides disk striping across multiple nodes in
a distributed parallel computer and file partitioning among tasks in a
parallel program. PVFS is unique among similar systems in that it uses a
stream-based approach that represents each file access with a single set of
request parameters and decouples the number of network messages from details
of the file striping and partitioning. PVFS also provides support for
efficient collective file accesses and allows overlapping file partitions. We
present results of early performance experiments that show PVFS achieves
excellent speedups in accessing moderately sized file segments.
Keywords: parallel I/O, cluster computing,
parallel file system, pario-bib
Keywords: parallel I/O, workstation cluster, text
retrieval, pario-bib
Comment: They implement a parallel text retrieval
application on a cluster of DEC 5000 workstations.
Abstract: In this paper, we study I/O server
placement for optimizing parallel I/O performance on switch-based clusters,
which typically adopt irregular network topologies to allow construction of
scalable systems with incremental expansion capability. Finding optimal
solution to this problem is computationally intractable. We quantified the
number of messages travelling through each network link by a workload
function, and developed three heuristic algorithms to find good solutions
based on the values of the workload function. Our simulation results
demonstrate performance advantage of our algorithms over a number of
algorithms commonly used in existing parallel systems. In particular, the
load-balance-based algorithm is superior to the other algorithms in most
cases, with improvement ratio of 10% to 95 in terms of parallel I/O
throughput.
Keywords: I/O server placement, network
topologies, switch-based clusters, pario-bib
Keywords: disk array, log-structured file system,
RAID, parallel I/O, pario-bib
Comment: Part of jin:io-book.
Keywords: parallel I/O, pario-bib
Keywords: parallel I/O, disk striping, disk array,
pario-bib
Keywords: parallel I/O, replication, file system,
disk mirroring, disk shadowing, pario-bib
Comment: A look at shadowed disks. If you have $k$
disks set up to read from the disk with the shortest seek, but write to all
disks, you have increased reliability, read time like the min of the seeks,
and write time like the max of the seeks. It appears that with increasing $k$
you can get good performance. But this paper clearly shows, since writes move
all disk heads to the same location, that the effective value of $k$ is
actually quite low. Only 4-10 disks are likely to be useful for most traffic
loads.
Abstract: It is now recognized that a high level
of I/O performance is crucial in making effective use of parallel machines
for many scientific application codes. This paper considers the I/O
requirements in one particular scientific application area; 3D modelling of
continental shelf sea regions. We identify some of the scientific aims which
drive the model development, and the consequent impact on the I/O needs. As a
case study we take a parallel production code running a simulation of the
North Sea on a Cray T3D platform and investigate the I/O performance in
dealing with the dominant I/O component; dumping of results data to disk. In
order to place the performance issues in a more general framework we
construct a simple theoretical model of I/O requirements, and use this to
probe the impact of available I/O performance on current and proposed
scientific objectives.
Keywords: parallel I/O application, pario-bib
Comment: In a Special Issue on I/O in Parallel
Applications, volume 12, numbers 3 and 4.
Keywords: RAID, disk array, parallel I/O,
distributed file system, pario-bib
Comment: One of the features of this system is the
way they develop and execute transaction plans as little scripts that are
built by the client, sent to the servers, and then executed by interpreters.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: They took the Unix file system from SunOS
and extended it to run on the CM-5. This involved handling non-power-of-two
block sizes, parallel I/O calls, large file sizes, and more encouragement for
extents to be allocated. The hardware is particularly suited to RAID 3 with a
16 byte striping unit, although in theory the software could do anything it
wants. Geared to data-parallel model. Proc nodes (PNs) contact the
timesharing daemon (TSD) on the control processor (CP), who gets block lists
from the file system, which runs on one of the CPs. The TSD then arranges
with the disk storage nodes (DSNs) to do the transfer directly with the PNs.
Each DSN has 8 MB of buffer space, 8 disk drives, 4 SCSI busses, and a SPARC
as controller. Partition managers mount non-local sfs via NFS. Performance
results good. Up to 185 MB/s on 118 (2 MB/s) disks.
Abstract: High-end storage systems, such as those
in large data centers, must service multiple independent workloads. Workloads
often require predictable quality of service, despite the fact that they have
to compete with other rapidly-changing workloads for access to common storage
resources. We present a novel approach to providing performance guaran-tees
in this highly-volatile scenario, in an efficient and cost-effective way.
Façade, a virtual store controller, sits between hosts and storage
devices in the network, and throttles individual I/O requests from multiple
clients so that devices do not saturate. We implemented a prototype, and
evaluated it using real workloads on an enterprise storage system. We also
instantiated it to the particular case of emulating commercial disk arrays.
Our results show that Façade satisfies performance objectives while
making efficient use of the storage resources-even in the presence of
failures and bursty workloads with stringent performance requirements.
Keywords: file systems, qos, quality of service,
pario-bib
Keywords: parallel I/O, pario-bib
Comment: This paper is about a NASA project
GEOS-DAS (Goddard Earth Observing System-Data Assimilation System). The goal
of the project is to produce ''accurate gridded datasets of atmospheric
fields''. The data will be used by meteorologists for weather analysis and
forecasts as well as being a tool for climate research. This paper discusses
their plans to parallelize the core code of the system. They include a
section on parallel I/O.
Abstract: Efficient collective output of
intermediate results to secondary storage becomes more and more important for
scientific simulations as the gap between process-ing power/interconnection
bandwidth and the I/O sys-tem bandwidth enlarges. Dedicated servers can
offload I/O from compute processors and shorten the execution time, but it is
not always possible or easy for an appli-cation to use them. We propose the
use of active buffer-ing with threads (ABT) for overlapping I/O with
com-putation efficiently and flexibly without dedicated I/O servers. We show
that the implementation of ABT in ROMIO, a popular implementation of MPI-IO,
greatly reduces the application-visible cost of ROMIO s collec-tive write
calls, and improves an application s overall performance by hiding I/O cost
and saving implicit syn-chronization overhead from collective write
operations. Further, ABT is high-level, platform-independent, and transparent
to users, giving users the benefit of over-lapping I/O with other processing
tasks even when the file system or parallel I/O library does not support
asyn-chronous I/O.
Keywords: parallel I/O, pario-bib
Abstract: In this paper, we discuss our experience
of providing high performance parallel I/O for a large-scale, on-going,
multi-disciplinary simulation project for solid propellant rockets. We
describe the performance and data management issues observed in this project
and present our solutions, including (1) support for relatively fine-grained
distribution of irregular datasets in parallel I/O, (2) a flexible data
management facility for inter-module communication, and (3) two schemes to
overlap computation with I/O. Performance results obtained from the rocket
simulation's development and production platforms show that our I/O
optimizations can dramatically reduce the simulation's visible I/O cost, as
well as the number of disk files, and significantly improve the overall
performance. Meanwhile, our data management facility helps to provide
simulation developers with simple user interfaces for parallel I/O.
Keywords: parallel I/O, pario-bib
Our extensive
simulations and analytical modeling yield important insights into the
limitations on parallel I/O performance due to network contention, and into
the possible gains in parallel I/O performance that can be achieved by tuning
the spatial layout of jobs. Applying these results, we devise a new
processor allocation strategy that is sensitive to parallel I/O traffic and
the resulting network contention. In performance evaluations driven by
synthetic workloads and by a real workload trace captured at the San Diego
Supercomputing Center, the new strategy improves the average response time of
parallel I/O intensive jobs by up to a factor of 4.5. Abstract: Input/Output is a big obstacle to
effective use of teraflops-scale computing systems. Motivated by earlier
parallel I/O measurements on an Intel TFLOPS machine, we conduct studies to
determine the sensitivity of parallel I/O performance on multi-programmed
mesh-connected machines with respect to number of I/O nodes, number of
compute nodes, network link bandwidth, I/O node bandwidth, spatial layout of
jobs, and read or write demands of applications.
Keywords: parallel I/O, pario-bib
Abstract: Parallel servers realize scalability and
availability by effectively using multiple hardware resources (i.e., nodes
and disks). Scalability is improved by distributing processes and data onto
multiple resources; and availability is maintained by substituting a failed
resource with a spare one. Dynamic Gateways extends these features to
networking, by balancing the traffic among multiple connections to the
network in order to improve scalability, and detours traffic around failed
resources to maintain availability. This is made transparent to the clients
and to applications in the server by using proxy and gratuitous ARP to
control the network traffic. A performance evaluation shows that Dynamic
Gateways improves the scalability (allowing the maximum networking
performance to increase with increasing number of connections) and the
performance (improving throughput and reducing access latency).
Keywords: parallel networking, network I/O,
parallel I/O, pario-bib
Comment: Contact fred-m@crl.hitachi.co.jp,
sagawa@crl.hitachi.co.jp, or tetanaka@kanagawa.hitachi.co.jp.
Abstract: A parallel finite element groundwater
transport code is used to compare three different strategies for performing
parallel I/O: (1) have a single processor collect data and perform sequential
I/O in large blocks, (2) use variations of vendor specific I/O extensions (3)
use the EDONIO I/O library. Each processor performs many writes of one to
four kilobytes to reorganize localdata in a global shared file. Our findings
suggest having a single processor collect data and perform large
block-contiguous operations may be quite efficient and portable for up to 32
processor configurations. This approach does not scale well for a larger
number of processors since the single processor becomes a bottleneck for
gathering data. The effective application I/O rate observed, which includes
times for opening and closing files, is only a fraction of the peak device
read/write rates. Some form of data redistribution and buffering in remote
memory as performed in EDONIO may yield significant improvements for
non-contiguous data I/O access patterns and short requests. Implementors of
parallel I/O systems may consider some form of buffering as performed in
EDONIO to speed up such I/O requirements.
Keywords: parallel I/O application, pario-bib
Comment: In a Special Issue on I/O in Parallel
Applications, volume 12, numbers 3 and 4.
Abstract: Traditionally, maximizing input/output
performance has required tailoring application input/output patterns to the
idiosyncrasies of specific input/output systems. The authors show that one
can achieve high application input/output performance via a low overhead
input/output system that automatically recognizes file access patterns and
adaptively modifies system policies to match application requirements. This
approach reduces the application developer's input/output optimization effort
by isolating input/output optimization decisions within a retargetable file
system infrastructure. To validate these claims, they have built a
lightweight file system policy testbed that uses a trained learning mechanism
to recognize access patterns. The file system then uses these access pattern
classifications to select appropriate caching strategies, dynamically
adapting file system policies to changing input/output demands throughout
application execution. The experimental data show dramatic speedups on both
benchmarks and input/output intensive scientific applications.
Keywords: parallel I/O, pario-bib
Comment: See also madhyastha:thesis, and related
papers.
Abstract: Input/output performance on current
parallel file systems is sensitive to a good match of application access
pattern to file system capabilities. Automatic input/output access
classification can determine application access patterns at execution time,
guiding adaptive file system policies. In this paper we examine a new method
for access pattern classification that uses hidden Markov models, trained on
access patterns from previous executions, to create a probabilistic model of
input/output accesses. We compare this approach to a neural network
classification framework, presenting performance results from parallel and
sequential benchmarks and applications.
Keywords: workload characterization, file access
pattern, parallel I/O, pario-bib
Comment: The most interesting thing in this paper
is the use of a Hidden Markov Model to understand the access pattern of an
application to a file. After running the application on the file once, and
simultaneously training their HMM, they use the result to tune the system for
the next execution (cache size, cache partitioning, prefetching, Intel file
mode, etc). They get much better performance in future runs. See also
madhyastha:thesis, and related papers.
Abstract: Parallel input/output systems attempt to
alleviate the performance bottleneck that affects many input/output intensive
applications. In such systems, an understanding of the application access
pattern, especially how requests from multiple processors for different file
regions are logically related, is important for optimizing file system
performance. We propose a method for automatically classifying these global
access patterns and using these global classifications to select and tune
file system policies to improve input/output performance. We demonstrate this
approach on benchmarks and scientific applications using global
classification to automatically select appropriate underlying Intel PFS
input/output modes and server buffering strategies.
Keywords: file access pattern, parallel I/O,
pario-bib
Comment: No page numbers: web and CDROM
proceedings only. See also madhyastha:thesis and related papers.
Keywords: informed prefetching, disk-directed I/O,
parallel I/O, pario-bib
Comment: They argue that if enough application
prefetches are made, a standard Unix interface will provide the same
performance as a collective I/O interface. She uses simulation to show that
if the file ordering is preserved, then the prefetch depth (the number of
advance requests) is bounded by the number of disk drives. They look at two
global access patterns: a simple interleaved sequential pattern and a 3-D
block decomposition. Their experiment used 8 procs and 8 disks and did a
comparison of the prefetching techniques to disk-directed I/O. Emperical
studies showed that they needed a prefetch horizon of one to two times the
number of disks to match the performance of disk-directed I/O, but the
prefetching techniques require more memory.
Abstract: Input/output performance on current
parallel file systems is sensitive to a good match of application access
patterns to file system capabilities. Automatic input/output access pattern
classification can determine application access patterns at execution time,
guiding adaptive file system policies. In this paper, we examine and compare
two novel input/output access pattern classification methods based on
learning algorithms. The first approach uses a feedforward neural network
previously trained on access pattern benchmarks to generate qualitative
classifications. The second approach uses hidden Markov models trained on
access patterns from previous executions to create a probabilistic model of
input/output accesses. In a parallel application, access patterns can be
recognized at the level of each local thread or as the global interleaving of
all application threads. Classification of patterns at both levels is
important for parallel file system performance; we propose a method for
forming global classifications from local classifications. We present results
from parallel and sequential benchmarks and applications that demonstrate the
viability of this approach.
Keywords: parallel I/O, file access pattern,
pario-bib
Keywords: multiprocessor file system, prefetching,
caching, parallel I/O, multiprocessor file system interface, pario-bib
Comment: See also madhyastha:thesis, and related
papers.
Keywords: parallel I/O, file access pattern,
pario-bib
Comment: See also madhyastha:classification,
madhyastha:global, madhyastha:adaptive, madhyastha:optimizing.
Abstract: The performance of high-speed
network-attached storage applications is often limited by end-system
overhead, caused primarily by memory copying and network protocol processing.
In this paper, we examine alternative strategies for reducing overhead in
such systems. We consider optimizations to remote procedure call (RPC)-based
data transfer using either remote direct memory access (RDMA) or network
interface support for pre-posting of application receive buffers. We
demonstrate that both mechanisms enable file access throughput that saturates
a 2Gb/s network link when performing large I/Os on relatively slow, commodity
PCs. However, for multi-client workloads dominated by small I/Os, throughput
is limited by the per-I/O overhead of processing RPCs in the server. For such
workloads, we propose the use of a new network I/O mechanism, Optimistic RDMA
(ORDMA). ORDMA is an alternative to RPC that aims to improve server
throughput and response time for small I/Os. We measured performance
improvements of up to 32% in server throughput and 36% in response time
with use of ORDMA in our prototype.
Keywords: file systems, rpc optimizations, rdma,
multi-client workload, small I/O, pario-bib
Abstract: Most studies of processor scheduling in
multiprogrammed parallel systems have ignored the I/O performed by
applications. Recent studies have demonstrated that significant I/O
operations are performed by a number of different classes of parallel
applications. This paper focuses on some basic issues that underlie
scheduling in multiprogrammed parallel environments running applications with
I/O. Characterization of the I/O behavior of parallel applications is
discussed first. Based on simulation models this research investigates the
influence of these I/O characteristics on processor scheduling.
Keywords: workload characterization, scheduling,
parallel I/O, pario-bib
Keywords: workload characterization, parallel I/O,
pario-bib
Comment: Analytical workload model. Simulation
studies. See also kwong:distribution.
Abstract: The paper studies different schemes to
enhance the reliability, availability and security of a high performance
distributed storage system. We have previously designed a distributed
parallel storage system that employs the aggregate bandwidth of multiple data
servers connected by a high speed wide area network to achieve scalability
and high data throughput. The general approach of the paper employs erasure
error correcting codes to add data redundancy that can be used to retrieve
missing information caused by hardware, software, or human faults. The paper
suggests techniques for reducing the communication and computation overhead
incurred while retrieving missing data blocks form redundant information.
These techniques include clustering, multidimensional coding, and the full
two dimensional parity scheme.
Keywords: parallel I/O, pario-bib
Keywords: parallel I/O, disk array, I/O
bottleneck, pario-bib
Comment: See also Electronics, Nov. 88 p 24, Dec.
88 p 112. Trade journal short on disk arrays. Very good intro. No new
technical content. Concentrates on RAID project. Lists several commercial
versions. Mostly concentrates on single-controller versions.
Abstract: Stripping techniques combined with an
adequate replication policy across the Grid offer the possibility to improve
significatively data access and processing times, while eliminating the need
for local data mirroring, so saving significatively on storage costs. First
results on a local cluster following a simple strategy are presented.
Keywords: RAID, RAID-1, data striping, GRID,
pario-bib
Keywords: parallel I/O, multiprocessor file system
interface, pario-bib
Comment: Man pages for MasPar file system
interface. They have either a single shared file pointer, after which all
processors read or write in an interleaved pattern, or individual (plural)
file pointer, allowing arbitrary access patterns. Updated in 1992 with many
more features.
Keywords: parallel I/O, pario-bib
Comment: Information about the early Maximum
Strategy disk array, which striped over 4 disk drives, apparently
synchronously.
Keywords: parallel I/O, reliability, disk
shadowing, disk mirroring, pario-bib
Comment: Variation on mirrored disks using more
than 2 disks, to spread the files around. Good performance increases.
Abstract: Shared file systems which use a
physically shared mass storage device have existed for many years, although
not on UNIX based operating systems. This paper describes a shared file
system (SFS) that was implemented first as a special project on the Gray
Research Inc. (CRI) UNICOS operating system. A more general product was then
built on top of this project using a HIPPI disk array for the shared mass
storage. The design of SFS is outlined, as well as some performance
experiences with the product. We describe how SFS interacts with the OSF
distributed file service (DFS) and with the CRI data migration facility
(DMF). We also describe possible development directions for the SFS product.
Keywords: mass storage, distributed file system,
parallel I/O, pario-bib
Comment: They use hardware to tie the same storage
device (a disk array) to several computers (Cray C90s). They build a custom
piece of hardware just to service semaphore requests very fast. HIPPI is the
interconnect. Details have a lot to do with the synchronization between
processors trying to update the same metadata; that's why they use the
semaphores.
Abstract: We propose a framework for I/O in
parallel and distributed systems. The framework is highly customizable and
extendible, and enables programmers to offer high level objects in their
applications, without requiring them to struggle with the low level and
sometimes complex details of high performance distributed I/O. Also, the
framework exploits application specific information to improve I/O
performance by allowing specialized programmers to customize the framework.
Internally, we use indirection and granularity control to support migration,
dynamic load balancing, fault tolerance, etc. for objects of the I/O system,
including those representing application data.
Keywords: input-output programs, object-oriented,
parallel systems; I/O performance, migration, dynamic load balancing, fault
tolerance, parallel I/O, pario-bib
Abstract: Modular clusters are now composed of
non- uniform nodes with different CPUs, disks or network cards so that
customers can adapt the cluster configuration to the changing technologies
and to their changing needs. This challenges dataflow parallelism as the
primary load balancing technique of existing parallel database systems. We
show in this paper that dataflow parallelism alone is ill suited for modular
clusters because running the same operation on different subsets of the data
can not fully utilize non-uniform hardware resources. We propose and evaluate
new load balancing techniques that blend pipeline parallelism with data
parallelism. We consider relational operators as pipelines of fine-grained
operations that can be located on different cluster nodes and executed in
parallel on different data subsets to best exploit non-uniform resources. We
present an experimental study that confirms the feasibility and effectiveness
of the new techniques in a parallel execution engine prototype based on the
open-source DBMS Predator.
Keywords: parallel query processing, load
balancing, parallel I/O, pario-bib
Keywords: parallel I/O, disk array, pario-bib,
RAID
Comment: Allocate small- and medium-sized files
entirely on one disk rather than striped, to cut seek and rotation latency
that would happen if they were spread across many disks.
Keywords: parallel I/O, disk array, disk striping,
pario-bib
Comment: Describes Strategy 2 Disk Array
Controller, which allows 4 or 8 drives, hardware striped, with parity drive
and 0-4 hot spares. Up to 4 channels to cpu(s). Logical block interface.
Defects, errors, formatting, drive failures all handled automatically. Peak
40 MB/s data transfer on each channel.
Keywords: multiprocessor architecture, parallel
I/O, pario-bib
Comment: Three node types: 4 SPARC (50 MHz), 1
SPARC + two Fujitsu vector procs, or 1 SPARC + 3 I/O ports. All have a
special communications processor that supports remote memory access. Each has
128 MBytes in 16 banks. Memory-memory transfer operations using ``remote
DMA'', supported by the communications processor. User-level comm interface,
with protection. Uses multistage network with 8x8 crossbar switches, looks
like a fat tree. S/BUS, separate from the memory bus, is used for I/O, either
directly, or through 2 SCSI and 1 ethernet. Control and diagnostic networks.
Parallel file system stripes across multiple partitions. Can use RAID.
Communications processor has its own MMU; control registers are mapped to
user space. Network-wide virtual addresses can support shared memory? Remote
store, atomic operations, global operations. Comm proc can support I/O
threads - but can it talk to the disks? OS based on Solaris 2, plus global
shared memory, parallel file system, and capability-based protection. Machine
is logically partitioned into login, devices, and parallel computation.
Abstract: This paper introduces a new concept
called Multi-Collective I/O (MCIO) that extends conventional collective I/O
to optimize I/O accesses to multiple arrays simultaneously. In this approach,
as in collective I/O, multiple processors co-ordinate to perform I/O on
behalf of each other if doing so improves overall I/O time. However, unlike
collective I/O, MCIO considers multiple arrays simultaneously; that is, it
has a more global view of the overall I/O behavior exhibited by application.
This paper shows that determining optimal MCIO access pattern is an
NP-complete problem, and proposes two different heuristics for the access
pattern detection problem (also called the assignment problem). Both of the
heuristics have been implemented within a runtime library, and tested using a
large-scale scientific application. Our preliminary results show that MCIO
out-performs collective I/O by as much as 87{PCT}. Our runtime library-based
implementation can be used by users as well as optimizing compilers. Based on
our results, we recom-mend future library designers for I/O-intensive
applications to include MCIO in their suite of optimizations.
Keywords: file systems, pario-bib
Keywords: network attached peripherals, analytic
model, mass storage, parallel I/O, pario-bib
Keywords: RAID, disk array, parallel I/O,
pario-bib
Comment: Part of jin:io-book; reformatted version
of menon:compare.
Keywords: parallel I/O, RAID, disk array,
pario-bib
Comment: Part of jin:io-book; reformatted version
of menon:sparing.
Keywords: RAID, disk array, parallel I/O,
pario-bib
Comment: He compares a RAID-5 disk array with a
log-structured array (LSA). An LSA is essentially an implementation of a
log-structured file system inside a disk controller. The disk controller
buffers up writes in a non-volatile cache; when the outgoing data buffer is
full, it is written to some large contiguous region of the disk. The
controller manages a directory to keep track of the various segment
locations, and does garbage collection (cleaning). They can insert a
compression algorithm in front of the cache so that they get better cache and
disk utilization by storing data in compressed form. for fair comparison they
compare with a similar feature in the plain RAID5 array.
Keywords: hierarchical storage, tape storage,
tertiary storage, tape robot, parallel I/O, pario-bib
Comment: Part of a special issue on parallel and
distributed I/O.
Abstract: This paper explores how choice of
sparing methods impacts the performance of RAID level 5 (or parity striped)
disk arrays. The three sparing methods examined are dedicated sparing,
distributed sparing, and parity sparing. For database type workloads with
random single block reads and writes, array performance is compared in four
different modes - normal mode (no disks have failed), degraded mode (a disk
has failed and its data has not been reconstructed), rebuild mode (a disk has
failed and its data is being reconstructed), and copyback mode(which is
needed for distributed sparing and parity sparing when failed disks are
replaced with new disks). Attention is concentrated on small disk subsystems
(fewer than 32 disks) where choice of sparing method has significant impact
on array performance, rather than large disk subsystems (64 or more disks).
It is concluded that, for disk subsystems with a small number of disks,
distributed sparing offers major advantages over dedicated sparing in normal,
degraded and rebuild modes of operation, even if one has to pay a copyback
penalty. Furthermore, it is better than parity sparing in rebuild mode and
similar to it in other operating modes, making it the sparing method of
choice.
Keywords: parallel I/O, RAID, disk array,
pario-bib
Abstract: The philosophy behind grid is to use
idle resources to achieve a higher level of computational services
(computation, storage, etc). Existing data grids solutions are based in new
servers, specific APIs and protocols, however this approach is not a
realistic solution for enterprises and universities, because this supposes
the deployment of new data servers across the company. This paper describes a
new approach to data access in computational grids. This approach is called
GridExpand, a parallel I/O middleware that integrates heterogeneous data
storage resources in grids. The proposed grid solution integrates available
data network solutions (NFS, CIFS, WebDAV) and makes possible the access to a
global grid file system. Our solution differs from others because it does not
need the installation of new data servers with new protocols. Most of the
data grid solutions use replication as the way to obtain high performance.
Replication, however, introduce consistency problem for many collaborative
applications, and sometimes requires the usage of lots of resources. To
obtain high performance, we apply the parallel I/O techniques used in
parallel file systems.
Keywords: data grids, parallel I/O, data
declustering, pario-bib
Keywords: disk striping, disk array, RAID,
parallel I/O, pario-bib
Keywords: parallel I/O, file system workload,
pario-bib
Comment: This application runs on the NASA Ames
iPSC/860. This application has some I/O: reading in the input file, which is
a set of x,y,z data points. I/O was really slow if formatted (ie, ASCII
instead of binary) or sequential instead of parallel. Any input record could
go to any processor; the first step in the algorithm (after the points are
read in) is essentially a kind of sort to move points around to localize
points and balance load.
Abstract: This article presents methods and tools
for building parallel applications based on commodity components: PCs, SCSI
disks, Fast Ethernet, Windows NT. Chief among these tools is CAP, our
computer-aided parallelization tool. CAP generates highly pipelined
applications that run communication and I/O operations in parallel with
processing operations. One of CAP's successes is the Visible Human Slice
Server, a 3D tomographic image server that allows clients to choose and view
any cross section of the human body.
Keywords: applications, image processing,
pario-app, parallel I/O, pario-bib
Keywords: parallel computing, image processing,
parallel I/O application, parallel I/O, pario-bib
Comment: The complete description of PS$^2$ and
its use with CAP, a parallelization tool, for data-flow-like support of
parallel I/O. Nice work. See also messerli:jimage, gennart:CAP,
vetsch:visiblehuman, messerli:tomographic.
Abstract: We propose a new approach for developing
parallel I/O- andcompute-intensive applications. At a high level of
abstraction, a macro data flow description describes how processing and disk
access operations are combined. This high-level description (CAP) is
precompiled into compilable and executable C++ source language. Parallel file
system components specified by CAP are offered as reusable CAP operations.
Low-level parallel file system components can, thanks to the CAP formalism,
be combined with processing operations in order to yield efficient pipelined
parallel I/O and compute intensive programs. The underlying parallel system
is based on commodity components (PentiumPro processors, Fast Ethernet) and
runs on top of WindowsNT. The CAP-based parallel program development approach
is applied to the development of an I/O and processing intensive tomographic
3D image visualization application. Configurations range from a single
PentiumPro 1-disk system to a four PentiumPro 27-disk system. We show that
performances scale well when increasing the number of processors and disks.
With the largest configuration, the system is able to extract in parallel and
project into the display space between three and four 512x512 images per
second. The images may have any orientation and are extracted from a 100
MByte 3D tomographic image striped over the available set of disks.
Keywords: parallel computing, parallel I/O,
parallel I/O application, image processing, pario-bib
Comment: See also messerli:jimage, gennart:CAP,
vetsch:visiblehuman, messerli:thesis.
Keywords: multiprocessor architecture, compiler,
parallel I/O, pario-bib
Comment: Includes some comments by Randy Katz
about parallel I/O, in particular, distinguishing between ``fat'' nodes (with
many disks, e.g., a RAID), and ``thin'' nodes (with one disk).
Keywords: multiprocessor I/O, I/O architecture,
distributed system, pario-bib
Comment: Advocates using dedicated server
processors for all I/O, e.g., disk server, terminal server, network server.
Pass I/O requests and data via messages or RPC calls over the interconnect
(here a shared bus). Server handles packaging, blocking, caching, errors,
interrupts, and so forth, freeing the main processors and the interconnect
from all this activity. Benefits: encapsulates I/O-related stuff in specific
places, accommodates heterogeneity, improves performance. Nice idea, but
allows for an I/O bottleneck, unless server can handle all the demand.
Otherwise would need multiple servers, more expensive than just multiple
controllers.
Keywords: file access pattern, supercomputer, disk
caching, prefetching, pario-bib
Comment: Same as miller:iobehave-tr except without
the appendix outlining trace format. Included in pario-bibliography not
because it measures a parallel workload, but because it is so often cited in
the parallel-IO community.
Abstract: Modern massively parallel file systems
provide high bandwidth file access by striping files across arrays of disks
attached to a few specialized I/O nodes. However, these file systems are hard
to use and difficult to integrate with workstations and tertiary storage.
RAMA addresses these problems by providing a high-performance massively
parallel file system with a simple interface. RAMA uses hashing to
pseudo-randomly distribute data to all of its disks, insuring high bandwidth
regardless of access pattern and eliminating bottlenecks in file block
accesses. This flexibility does not cause a large loss of performance -
RAMA's simulated performance is within 10-15% of the optimum performance of
a similarly-sized striped file system, and is a factor of 4 or more better
than a striped file system with poorly laid out data.
Keywords: multiprocessor file system, parallel
I/O, pario-bib
They use simulation to study performance with synthetic access patterns.
Unfortunately they simulated rather small files and patterns. The paper talks
quite a bit about disk (space and bandwidth) utilization, and network
bandwidth utilization. One of the big benefits of this hash-based approach is
that it tends to distribute the traffic to the network and to the disks very
evenly, even under highly regular access patterns that might unbalance a
traditional striped approach. Finally, they claim to do well on small-file
workloads as well as supercomputer workloads. Comment: They use parallel disks of a
multiprocessor as a set-associative cache for tertiary storage. Each "disk
line" contains a set of blocks, and a little index that lists the blocks
contained in the disk line. To access block b of a file, you hash on b/s,
where s is a small factor like 4; that encourages consecutive blocks to land
in the same disk line, for better locality. That gives you the disk line
number. From that you compute the disk number, and the node number. Send a
message to that node. It reads through the index for that disk line to find
the block within the line. Metadata like file permissions are stored in the
disk line with the first block of the file. Part of the paper deals with
file-system integrity; no fsck is needed. When RAMA goes to tertiary storage,
it reads a large batch of the file, but need not read the entire file into
disk cache. Dirty data are flushed back to tertiary store periodically.
Keywords: parallel file system, parallel I/O,
database, pario-bib
Comment: This is really for databases. They
identify two types of file access: one where the file can be operated on as a
set of subfiles, each independently by a processor (what they call MIMD
mode), and another where the file must be operated on with a centralized
control (SIMD mode), in their case to search a B-tree whose nodes span the
set of processors. Basically it is a host connected to a controller, that is
connected to a set of small I/O processors, each of which has access to disk.
In many ways a uniprocessor perspective. Paper design, with simulation
results.
Abstract: Data intensive computer applications
suffer from inadequate use of parallelism for processing data stored on
secondary storage devices. Devices such as database machines are useful in
some applications, but many applications are too small or specialized to use
such technology. To bridge this gap, the authors introduce the parallel
secondary storage (PASS) system. PASS is based on a network of
microcomputers. The individual microcomputers are assigned to a unit of
secondary storage and the operations of the microcomputers are initiated and
monitored by a control processor. The file system is capable of acting as
either an SIMD or an MIMD machine. Communication between the individual
microcomputers and the control processor is described. The integration of the
multiple microcomputers into the primitive operations on a file is examined.
Finally, the strategies employed to enhance performance in the
multiprogramming environment are discussed.
Keywords: parallel I/O, parallel file system,
multiprocessor file system, pario-bib
Keywords: parallel I/O, parallel file system,
pario-bib
Abstract: We discuss the results of a
collaborative project on parallel processing of Synthetic Aperture Radar
(SAR) data, carried out between the NASA/Jet Propulsion Laboratory (JPL), the
California Institute of Technology (Caltech) and Intel Scalable Systems
Division (SSD). Through this collaborative effort, we have successfully
parallelized the most compute-intensive SAR correlator phase of the
Spaceborne Shuttle Imaging Radar-C/X-Band SAR (SIR-C/X-SAR) code, for the
Intel Paragon. We describe the data decomposition, the scalable
high-performance I/O model, and the node-level optimizations which enable us
to obtain efficient processing throughput. In particular, we point out an
interesting double level of parallelization arising in the data decomposition
which increases substantially our ability to support ``high volume'' SAR.
Results are presented from this code running in parallel on the Intel
Paragon. A representative set of SAR data, of size 800 Megabytes, which was
collected by the SIR-C/X-SAR instrument aboard NASA's Space Shuttle in 15
seconds, is processed in 55 seconds on the Concurrent Supercomputing
Consortium's Paragon XP/S 35+. This compares well with a time of 12 minutes
for the current SIR-C/X-SAR processing system at JPL. For the first time, a
commercial system can process SIR-C/X-SAR data at a rate which is approaching
the rate at which the SIR-C/X-SAR instrument can collect the data. This work
has successfully demonstrated the viability of the Intel Paragon
supercomputer for processing ``high volume'' Synthetic Aperture Radar data in
near real-time.
Keywords: parallel I/O, pario-bib
Comment: Available only on CD-ROM and WWW.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: The multiprocessor's file system acts as
a block cache for tertiary storage. Disk space is broken into ``lines'' of a
few MB. Each line has a descriptor telling what blocks it has, and their
status. (fileid, offset) hashed to find (disk, linenum). Intrinsic metadata
stored at start of each file; positional metadata implicit in hashing, and
line descriptors. Sequentiality parameter puts several blocks of a file in
the same line, to improve medium-sized requests (otherwise generate lots of
request-response net traffic). Not clear on best choice of size. No mention
of atomicity wrt concurrent writes to same data. Blocks migrate to tertiary
storage as they get old. Fetched on demand, by block (not file).
Self-describing blocks have ids in block - leads to screwy block sizes?
Keywords: parallel file system, pario-bib
Comment: Simulation results. RAMA distributes
blocks of each file randomly across disks, which are attached to all
processor nodes, using a hash function. Thus there is no centralized
metadata. The big benefit is uniform performance regardless of access
pattern; they found one situation where it was 10% slower than an optimal
striped layout, but many cases where they were as much as 4 times faster than
bad striped data layouts. So, they can give reasonable performance without
the need for programmer- or manager-specified data layouts.
Keywords: multiprocessor file system, pario-bib
Comment: A simple file system for a transputer
network, attached to a single disk device. Several procs are devoted to the
file system, but really just act as buffers for the host processor that runs
the disk. They provide sequential, random access, and indexed files, either
byte- or record-oriented. Some prototypes; no results. They add buffering and
double buffering, but don't really get into anything interesting.
Keywords: bibliography, parallel computing,
distributed computing, pario-bib
Comment: This reference is the original
publication of Eugene's annotated bibliography. It has grown tremendously and
is now huge. Because of the copyright considerations, you can't just nab it
off the net, but it is free for the asking from Eugene. Send mail to
eugene@nas.nasa.gov.
Abstract: In this paper, a generalized
input/output (I/O) data format and library for a module-based parallel finite
element analysis system are proposed. The module-based system consists of
pre-, main- and post-modules, as well as some common libraries. The present
I/O library, called ADVENTURE_IO, and data format are developed specifically
for use in parallel high-performance computational mechanics system. These
are rather simple compared to other general-purpose I/O systems such as
netCDF and HDF5. A simple container called a finite element generic
attributes (FEGAs) document enables the handling of almost all the I/O data
in a parallel finite element method code. Due to the simplicity of the
present system, tuning up the I/O library for a specific parallel environment
is easy. Other major features of the present system are: (1) it possesses a
generalized collaboration mechanism consisting of multiple modules in a
distributed computing environment employing common object request broker
architecture, and (2) abstracted data description employed in the
FEGA/HDDM_FEGA document enables the development of a unique domain decomposer
that can subdivide any kind of input data.
Keywords: data format, finite element method,
generalized I/O data, hierarchical domain decomposition, pario-app, pario-bib
Abstract: RAID5 disk arrays provide high
performance and high reliability for reasonable cost. However RAIDS suffers a
performance penalty during block updates. We examine the feasibility of using
"dynamic parity striping" to improve the performance of block updates.
Instead of updating each block independently, this method buffers a number of
updates, generates a new stripe composed of the newly updated blocks, then
writes the full stripe back to disk. Two implementations are considered in
this paper. One is a log-structured file system (LFS) based method and the
other is Virtual Striping. Both methods achieve much higher performance than
conventional approaches. The performance characteristics of the LFS based
method and the Virtual Striping method are clarified.
Keywords: disk array, RAID, disk striping,
parallel I/O, pario-bib
Keywords: parallel I/O, I/O, disk architecture,
disk striping, reliability, pario-bib
Comment: Commercially available: Micropolis
Systems' Parallel Disk 1800 series. Four disks plus one parity disk,
synchronized and byte-interleaved. SCSI interface. Total capacity 1.5 GBytes,
sustained transfer rate of 4 MBytes/s. MTTF 140,000 hours. Hard and soft
errors corrected in real-time. Failed drives can be replaced while system is
running.
Keywords: storage area network, pario-bib
Comment: Part of jin:io-book.
Keywords: RAID, parallel I/O, distributed file
system, transaction, pario-bib
Comment: See other Swift papers, e.g.,
cabrera:pario and long:swift-raid. This paper describes the basic idea of a
using a transaction driver to implement RAID over a distributed system. Then
it spends most of the time describing the details of the implementation. The
basic idea is that processors execute transaction drivers, which provide
virtual CPUs to execute scripts of atomic 'instructions', where the
instructions are high-level things like read block, write block, compute
parity, etc. The transaction driver multiprocesses several scripts if
necessary. (Although they describe it in the context of a RAID implementation
it certainly could be used for other complex distributed services.) The
instructions are often transaction pairs, which compile into a pair of
instructions, one for this node and one for the remote node. This node sends
the program to the remote node, and they execute them separately, keeping
synchronized for transaction pairs when necessary. See also the newer paper
in Computing Surveys, long:swift-raid.
Abstract: Efficient storage and retrieval of
multi-attribute datasets have become one of the essential requirements for
many data-intensive applications. The Cartesian product file has been known
as an effective multi-attribute file structure for partial-match and
best-match queries. Several heuristic methods have been developed to
decluster Cartesian product files across multiple disks to obtain high
performance for disk accesses. Though the scalability of the declustering
methods becomes increasingly important for systems equipped with a large
number of disks, no analytic studies have been done so far. In this paper we
derive formulas describing the scalability of two popular declustering
methods Disk Modulo and Fieldwise Xor for range queries, which are the most
common type of queries. These formulas disclose the limited scalability of
the declustering methods and are corroborated by extensive simulation
experiments. From the practical point of view, the formulas given in this
paper provide a simple measure which can be used to predict the response time
of a given range query and to guide the selection of a declustering method
under various conditions.
Keywords: parallel I/O, parallel database,
declustering, pario-bib
Keywords: parallel I/O, multiprocessor file
system, interprocessor communication, pario-bib
Comment: They propose several enhancements to
disk-directed I/O (see kotz:diskdir) that aim to improve performance on
fine-grained distributions, that is, where each block from the disk is broken
into small pieces that are scattered among the compute processors. One
enhancement combines multiple pieces, possibly from separate disk blocks,
into a single message. Another is to use two-phase I/O (see
delrosario:two-phase), but to use disk-directed I/O to read data from the
disks into CP memories, efficiently, then permute. This latter technique is
probably faster than normal two-phase I/O that uses a traditional file
system, not disk-directed I/O, for the read.
Abstract: Parallel languages rarely specify
parallel I/O constructs, and existing commercial systems provide the
programmer with a low-level I/O interface. We present design principles for
integrating I/O into languages and show how these principles are applied to a
virtual-processor-oriented language. We illustrate how machine-independent
modes are used to support both high performance and generality. We describe
an automatic mode detection technique that saves the programmer from extra
syntax and low-level file system details. We show how virtual processor file
operations, typically small by themselves, are combined into efficient
large-scale file system calls. Finally, we present a variety of benchmark
results detailing design tradeoffs and the performance of various modes.
Keywords: parallel I/O, data parallelism,
pario-bib
Comment: Updated version of TR 95-80-9. See
moore:stream. Interesting approach, where they permit a fairly normal fread
and fwrite kind of interface, with each VP having its own stream. They choose
their own format for the file, and switch between formats (and internal
buffering) depending on the particulars of the fread and fwrite parameters.
They seem to have good performance, and a familiar interface. They are left
with a non-standard file format.
Abstract: Brief descriptions of the I/O
requirements for four production oceanography programs running at Oregon
State University are presented. The applications all rely exclusively on
array-oriented, sequential file operations. Persistent files are used for
checkpointing and movie making, while temporary files are used to store
out-of-core data.
Keywords: data parallel, file system workload,
parallel I/O, pario-bib
Comment: See moore:detection, moore:stream. Only
three pages.
Keywords: data parallel, parallel I/O, pario-bib
Abstract: Although hardware supporting parallel
file I/O has improved greatly since the introduction of first-generation
parallel computers, the programming interface has not. Each vendor provides a
different logical view of parallel files as well as nonportable operations
for manipulating files. Neither do parallel languages provide standards for
performing I/O. In this paper, we describe a view of parallel files for
data-parallel languages, dubbed Stream*, in which each virtual processor
writes to and reads from its own stream. In this scheme each virtual
processor's I/O operations have the same familiar, unambiguous meaning as in
a sequential C program. We demonstrate how I/O operations in Stream* can run
as fast as those of vendor-specific parallel file systems on the operations
most often encountered in data-parallel programs. We show how this system
supports general virtual processor operations for debugging and elemental
functions. Finally, we present empirical results from a prototype Stream*
system running on a Meiko CS-2 multicomputer.
Keywords: data parallel, parallel I/O, pario-bib
Comment: See moore:stream; nearly identical. See
also moore:detection. This paper gives a little bit earlier description of
the Stream* idea than does moore:detection, but you'd be pretty much complete
just reading moore:detection.
Keywords: manufacturing, integrated chip, parallel
I/O, pario-bib
Comment: They describe "IMaD", a parallel code
that used to support product engineering of full-scale integrated circuits.
The code itself simulates the entire integrated circuit to address three
primary apects of product engineering: to assure the an IC is manufacturable,
to monitor its lifetime yeild and reliability, and to support IC test and
failure analysis. The simulation is computationally, memory and I/O
intensive. While the paper primarily describes the model and the simulation
equations, the talk addressed the issue of parallel I/O, where the data for
each processor was written to a separate disk. Not exactly a novel approach,
but it emphasises the fact that the I/O requirements are large enough that
they used an approach other than a standard serial method.
Abstract: This paper presents the design and
evaluation of a multi-threaded runtime library for parallel I/O. We extend
the multi-threading concept to separate the compute and I/O tasks in two
separate threads of control. Multi-threading in our design permits a)
asynchronous I/O even if the underlying file system does not support
asynchronous I/O; b) copy avoidance from the I/O thread to the compute thread
by sharing address space; and c) a capability to perform collective I/O
asynchronously without blocking the compute threads. Further, this paper
presents techniques for collective I/O which maximize load balance and
concurrency while reducing communication overhead in an integrated fashion.
Performance results on IBM SP2 for various data distributions and access
patterns are presented. The results show that there is a tradeoff between the
amount of concurrency in I/O and the buffer size designated for I/O; and
there is an optimal buffer size beyond which benefits of larger requests
diminish due to large communication overheads.
Keywords: threads, parallel I/O, pario-bib
Keywords: parallel I/O, disk architecture,
pario-bib
Comment: A short paper on some basic techniques
used by disk controllers to improve throughput: seek optimization, request
combining, request queuing, using multiple drives in parallel, scatter/gather
DMA, data caching, read-ahead, cross-track read-ahead, write-back caching,
segmented caching, reduced latency (track buffering), and format skewing.
[Most of these are already handled in Unix file systems.]
Keywords: parallel I/O, disk array, pario-bib,
RAID
Comment: Transaction-processing workload dominated
by small I/Os. They compare RAID 5, Parity Striping (which was designed for
TP because it avoids lots of seeks on medium-sized requests, by declustering
parity but not data), mirroring, and RAID 0. RAID 5 does better than
parity striping due to its load balancing ability on the skewed workload.
RAID 5 also better as the load increases.
Abstract: Current operating systems offer poor
performance when a numeric application's working set does not fit in main
memory. As a result, programmers who wish to solve ``out-of-core'' problems
efficiently are typically faced with the onerous task of rewriting an
application to use explicit I/O operations (e.g., read/write). In this paper,
we propose and evaluate a fully-automatic technique which liberates the
programmer from this task, provides high performance, and requires only
minimal changes to current operating systems. In our scheme, the compiler
provides the crucial information on future access patterns without burdening
the programmer, the operating system supports non-binding prefetch and
release hints for managing I/O, and the operating system cooperates with a
run-time layer to accelerate performance by adapting to dynamic behavior and
minimizing prefetch overhead. This approach maintains the abstraction of
unlimited virtual memory for the programmer, gives the compiler the
flexibility to aggressively move prefetches back ahead of references, and
gives the operating system the flexibility to arbitrate between the competing
resource demands of multiple applications. We have implemented our scheme
using the SUIF compiler and the Hurricane operating system. Our experimental
results demonstrate that our fully-automatic scheme effectively hides the I/O
latency in out-of-core versions of the entire NAS Parallel benchmark suite,
thus resulting in speedups of roughly twofold for five of the eight
applications, with two applications speeding up by threefold or more.
Keywords: compiler, prefetch, parallel I/O,
pario-bib
Comment: Best Paper Award.
Keywords: parallel I/O, pario-bib
Comment: An overview of PIOUS and its performance.
Results for partitioned and self-scheduled access pattern. See other moyer:*
papers. The big thing about PIOUS over previous parallel file systems is its
internal use of transactions for concurrency control and user-selectable
fault-tolerance guarantees, and its optional support of user-level
transactions.
Abstract: Parallel file systems employ data
declustering to increase I/O throughput. But because a single read or write
operation can generate data accesses on multiple independent storage devices,
a concurrency control mechanism must be employed to retain familiar file
access semantics. Concurrency control negates some of the performance
benefits of data declustering by introducing additional file access overhead.
This paper examines the performance characteristics of the transaction-based
concurrency control mechanism implemented in the PIOUS parallel file system.
Results demonstrate that linearizability of file access operations is
provided without loss of scalability or stability.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: ``substantially different material than
presented in a previous report,'' moyer:scalable-tr. But it seems like the
moyer:scalable IOPADS paper is largely a subset of this TR. He describes how
they use volatile transactions, and does some experiments with PIOUS to
measure their efficiency. Basically, they use a 2-phase commit protocol,
using timeouts to detect deadlock and transaction aborts to remedy the
deadlock. Results for partitioned and sequential access patterns.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Keywords: parallel I/O, parallel file system,
workstation cluster, file system interface, pario-bib
Comment: See moyer:pious. A further description of
the PIOUS parallel file system for cluster computing. (Beta-test version
available for ftp). They support parafiles, which are collections of
segments, each segment residing on a different server. The segments can be
viewed separately or can be interleaved into a linear sequence using an
arbitrary chunk size. They also support transactions to support sequential
consistency.
Keywords: parallel I/O, parallel file system,
workstation cluster, file system interface, pario-bib
Comment: Basically, I/O for clusters of
workstations; ideally, it is parallel, heterogeneous, fault tolerant, etc.
File servers are independent, have only a local view. Single server used to
coordinate open(). Client libraries implement the API and depend on the
servers only for storage mechanism. Servers use transactions internally -
but usually these are lightweight transactions, only used for concurrency
control and not recovery. Full transactions are supported for times when the
user wants the extra fault tolerance. They have files that are in some sense
2-dimensional. Sequential consistency. User-controllable fault tolerance.
Performance: 2 clients max out the transport (ethernet). ``Stable'' mode is
slow, as is self-scheduled mode. No client caching. See moyer:pario.
Abstract: Parallel file systems employ data
declustering to increase I/O throughput. As a result, a single read or write
operation can generate concurrent data accesses on multiple storage devices.
Unless a concurrency control mechanism is employed, familiar file access
semantics are likely to be violated. This paper details the transaction-based
concurrency control mechanism implemented in the PIOUS parallel file system.
Performance results are presented demonstrating that sequential consistency
semantics can be provided without loss of system scalability.
Keywords: parallel I/O, pario-bib
Comment: Seems to be a subset of
moyer:scalable-tr, and for that matter, moyer:characterize. Results for
partitioned access pattern.
Abstract: Parallel file systems employ data
declustering to increase \mbox{I/O} throughput. As a result, a single read or
write operation can generate concurrent data accesses on multiple storage
devices. Unless a concurrency control mechanism is employed, familiar file
access semantics are likely to be violated. This paper details the
transaction-based concurrency control mechanism implemented in the PIOUS
parallel file system. Performance results are presented demonstrating that
sequential consistency semantics can be provided without loss of system
scalability.
Keywords: parallel I/O, parallel file system,
concurrency control, synchronization, transaction, pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Abstract: Parallel file systems employ data
declustering to increase I/O throughput. As a result, a single read or write
operation can generate concurrent data accesses on multiple storage devices.
Unless a concurrency control mechanism is employed, familiar file access
semantics are likely to be violated. This paper details the transaction-based
concurrency control mechanism implemented in the PIOUS parallel file system.
Performance results are presented demonstrating that sequential consistency
semantics can be provided without loss of system scalability.
Keywords: parallel I/O, parallel file system,
pario-bib
Comment: They describe volatile
transactions as a way of providing the appopriate sequential consistency
among file-read and -write operations (a feature not provided by most file
systems). Their PIOUS library implements these transactions with strict
2-phase locking. They show some performance results, though only on a limited
and relatively simple benchmark. If nothing else this paper reminds us all
that atomicity of file-read and -write requests should be available to the
user (eg, note how they are optional in Vesta). Published as moyer:scalable.
Keywords: parallel I/O, message-passing,
multiprocessor file system interface, pario-bib
Comment: This is the definition of the MPI2
message-passing standard, which includes an interface for parallel I/O.
Supercedes mpi-ioc:mpi-io5 and earlier versions. See the MPI2 web page at
http://www.mpi-forum.org. The I/O section is at
http://www.mpi-forum.org/docs/mpi-20-html/node172.html.
Keywords: parallel I/O, message-passing,
multiprocessor file system interface, pario-bib
Comment: Supercedes corbett:mpi-io4 and earlier
versions. See the MPI-IO Web page at http://parallel.nas.nasa.gov/MPI-IO/.
Keywords: MPI, message passing, parallel
computing, library, parallel I/O, pario-bib
Comment: Chapter 9 is about I/O extensions.
Abstract: The development and evaluation of a
tuple set manager (TSM) based on multikey index data structures is a main
part of the PARABASE project at the University of Vienna. The TSM provides
access to parallel mass storage systems using tuple sets instead of
conventional files as the central data structure for application programs. A
proof-of-concept prototype TSM is already implemented and operational on an
iPSC/2. It supports tuple insert and delete operations as well as exact
match, partial match, and range queries at system call level. Available
results are from this prototype on the one hand and from various performance
evaluation figures. The evaluation results demonstrate the performance gain
achieved by the implementation of the tuple set management concept on a
parallel mass storage system.
Keywords: parallel database, mass storage,
parallel I/O, pario-bib
Keywords: file system, disk striping, disk
mirroring, pario-bib
Keywords: disk array, parallel, performance
analysis, pario-bib
Comment: Looked at RAID5 when in failure mode. For
small-reads workload, could only get 50% of normal. So they decouple cluster
size and parity-group size, so that they decluster over more disks than group
size; during failure, this causes less of a load increase on surviving disks.
Keywords: parallel I/O, multimedia, databases,
pario-bib
Comment: Introduction to a special issue.
Keywords: distributed file system, multiprocessor
file system, pario-bib
Comment: See also broom:acacia, broom:impl,
lautenbach:pfs, and broom:cap. They examine ways to manage a distributed file
cache, without replication. Since there is no replication, the concurrency
control problems boil down to providing atomicity for multi-block, multi-site
requests. This is handled essentially by serializing the request: send the
request to the first site, and have it forward the request from site to site
as each block is processed. This works fine but completely serializes all
multi-block requests, somewhat defeating the purpose. Thus, they get
concurrency between requests, by having multiple servers, but no parallelism
within requests.
Keywords: buffering, file caching, tertiary
storage, tape robot, file migration, parallel I/O, pario-bib
Comment: Ways to use secondary and tertiary
storage in parallel, and buffering mechanisms for applications with
concurrent I/O requirements.
Keywords: multiprocessor file system, pario-bib
Comment: They describe a file system for general
message-passing, distributed-memory, separate I/O and compute node,
multicomputers. They provide few details, although they cite a lot of their
tech reports. There are a few simulation results, but none show anything
unintuitive.
Keywords: parallel I/O, pario-bib
Comment: Using parallel I/O channels to access
striped disks, in parallel from a supercomputer. They chain (i.e.,
combine) requests to a disk for large contiguous accesses.
Keywords: collective I/O, multiprocessor file
system, parallel I/O, pario-bib
Abstract: JUMP-1 is a distributed shared-memory
massively parallel computer and is composed of multiple clusters of
interconnected network called RDT (Recursive Diagonal Torus). Each cluster in
JUMP-1 consists of 4 element processors, secondary cache memories, and 2 MBP
(Memory Based Processor) for high-speed synchronization and communication
among clusters. The I/O subsystem is connected to a cluster via a high-speed
serial link called STAFF-Link. The I/O buffer memory is mapped onto the
JUMP-1 global shared-memory to permit each I/O access operation as memory
access. In this paper we describe evaluation of the fundamental performance
of the disk I/O subsystem using event-driven simulation, and estimated
performance with a Video On Demand (VOD) application.
Keywords: parallel I/O, I/O architecture,
pario-bib
Keywords: parallel I/O, pario-bib
Abstract: This paper presents a measurement and
simulation based study of parallel I/O in a high-performance cluster system:
the Pittsburgh Supercomputing Center (PSC) DEC Alpha Supercluster. The
measurements were used to characterize the performance bottlenecks and the
throughput limits at the compute and I/O nodes, and to provide realistic
input parameters to PioSim, a simulation environment we have developed to
investigate parallel I/O performance issues in cluster systems. PioSim was
used to obtain a detailed characterization of parallel I/O performance, in
the high performance cluster system, for different regular access patterns
and different system configurations. This paper also explores the use of
local disks at the compute nodes for parallel I/O, and finds that the local
disk architecture outperforms the traditional parallel I/O over remote I/O
node disks architecture, even when as much as 68-75% of the requests from
each compute node goes to remote disks.
Keywords: performance analysis, parallel I/O,
pario-bib
Keywords: multiprocessor architecture, MIMD,
parallel I/O, pario-bib
Comment: Has 1-32 50MHz Intel 486 processors.
Parallel independent disks on the disk nodes, separate from the processor
nodes. Tree interconnect. Aimed at database applications.
Keywords: parallel I/O, disk array, pario-bib
Comment: Discusses disk arrays and striping.
Transfer size is important to striping success: small size transfers are
better off with independent disks. Synchronized rotation is especially
important for small transfer sizes, since then the increased rotational
delays dominate. Fine grain striping involves less assembly/disassembly
delay, but coarse grain (block) striping allows for request parallelism. Fine
grain striping wastes capacity due to fixed size formatting overhead. He also
derives exact MTTF equation for 1-failure tolerance and on-line repair.
Keywords: parallel I/O, disk architecture, disk
caching, I/O bottleneck, pario-bib
Comment: Compares four different ways of
restructuring IBM disk controllers and channels to obtain more parallelism.
They use parallel heads or parallel actuators. The best results come when
they replicate the control electronics to maintain the number of data paths
through the controller. Otherwise the controller bottleneck reduces
performance. Generally, for large or small transfer sizes, parallel heads
with replication gave better performance.
Keywords: out-of-core algorithm, parallel I/O
algorithm, scientific computing, vector computer, pario-bib
Comment: They describe an out-of-core FFT
algorithm for vector computers (one disk, one vector processor). They
implemented it on a Convex and show good performance. Basically, the segment
the array, do FFTs on each segment, and do some transposing and other stuff
to combine the segments. Each segment is basically a memoryload. Seems
parallelizable too.
Keywords: Unix, parallel I/O, data parallel,
pario-bib
Comment: Cite nickolls:maspar-io.
This paper introduces the MasPar
I/O system, designed to provide balanced and and scalable data-parallel Unix
I/O. The architecture and implementation of the I/O hardware and software are
described. Key elements include parallel access to conventional Unix file
descriptors and a self-routing multistage network coupled with a buffer
memory for flexible parallel I/O transfers. Performance measurements are
presented for parallel Unix I/O with a scalable RAID disk array, a RAM disk,
and a HIPPI interconnect. Abstract: Scalable parallel computers require I/O
balanced with computational power to solve data-intensive problems.
Distributed memory architectures call for I/O hardware and software beyond
those of conventional scalar systems.
Keywords: parallel I/O, multiprocessor file
system, SIMD, pario-bib
Comment: This provides the definitive reference
for the Maspar parallel-I/O architecture and file system. This paper includes
a brief discussion of the interface and performance results. Also includes
some HIPPI interface performance results. This paper is the conference
version of nickolls:dpio, so cite this one.
Abstract: In out-of-core computations, disk
storage is treated as another level in the memory hierarchy, below cache,
local memory, and (in a parallel computer) remote memories. However the tools
used to manage this storage are typically quite different from those used to
manage access to local and remote memory. This disparity complicates
implementation of out-of-core algorithms and hinders portability. We describe
a programming model that addresses this problem. This model allows parallel
programs to use essentially the same mechanisms to manage the movement of
data between any two adjacent levels in a hierarchical memory system. We take
as our starting point the Global Arrays shared-memory model and library,
which support a variety of operations on distributed arrays, including
transfer between local and remote memories. We show how this model can be
extended to support explicit transfer between global memory and secondary
storage, and we define a Disk Resident Arrays Library that supports such
transfers. We illustrate the utility of the resulting model with two
applications, an out-of-core matrix multiplication and a large computational
chemistry program. We also describe implementation techniques on several
parallel computers and present experimental results that demonstrate that the
Disk Resident Arrays model can be implemented very efficiently on parallel
computers.
Keywords: parallel I/O, pario-bib
Abstract: Recent developments in I/O systems on
scalable parallel computers have sparked renewed interest in out-of-core
methods for computational chemistry. These methods can improve execution time
significantly relative to "direct" methods, which perform many redundant
computations. However, the widespread use of such out-of-core methods
requires efficient and portable implementations of often complex I/O
patterns. The ChemIO project has addressed this problem by defining an I/O
interface that captures the I/O patterns found in important computational
chemistry applications and by providing high-performance implementations of
this interface on multiple platforms. This development not only broadens the
user community for parallel I/O techniques but also provides new insights
into the functionality required in general-purpose scalable I/O libraries and
the techniques required to achieve high performance I/O on scalable parallel
computers.
Keywords: parallel I/O application, pario-bib
Comment: In a Special Issue on I/O in Parallel
Applications, volume 12, numbers 3 and 4.
Abstract: We propose a new parallel, noncollective
I/O strategy called Distant I/O that targets clustered computer systems in
which disks are attached to compute nodes. Distant I/O allows one-sided
access to remote secondary storage without installing server processes or
daemons on remote compute nodes. We implemented this model using Active
Messages and demonstrated its performance advantages over the PIOFS parallel
filesystem for an I/O-intensive parallel application on the IBM SP.
Keywords: parallel I/O, pario-bib, remote I/O
Abstract: As the I/O needs of parallel scientific
applications increase, file systems for multiprocessors are being designed to
provide applications with parallel access to multiple disks. Many parallel
file systems present applications with a conventional Unix-like interface
that allows the application to access multiple disks transparently. This
interface conceals the parallelism within the file system, which increases
the ease of programmability, but makes it difficult or impossible for
sophisticated programmers and libraries to use knowledge about their I/O
needs to exploit that parallelism. Furthermore, most current parallel file
systems are optimized for a different workload than they are being asked to
support. We introduce Galley, a new parallel file system that is intended to
efficiently support realistic parallel workloads. We discuss Galley's file
structure and application interface, as well as an application that has been
implemented using that interface.
Keywords: parallel file system, parallel I/O,
multiprocessor file system interface, pario-bib, dfk
Comment: See also nieuwejaar:galley-perf. Also
available at
http://www.acm.org/pubs/citations/proceedings/supercomputing/237578/p374-nieuwejaar/
Abstract: As the I/O needs of parallel scientific
applications increase, file systems for multiprocessors are being designed to
provide applications with parallel access to multiple disks. Many parallel
file systems present applications with a conventional Unix-like interface
that allows the application to access multiple disks transparently. This
interface conceals the parallelism within the file system, which increases
the ease of programmability, but makes it difficult or impossible for
sophisticated programmers and libraries to use knowledge about their I/O
needs to exploit that parallelism. Furthermore, most current parallel file
systems are optimized for a different workload than they are being asked to
support. We introduce Galley, a new parallel file system that is intended to
efficiently support realistic parallel workloads. Initial experiments,
reported in this paper, indicate that Galley is capable of providing
high-performance I/O to applications that access data in patterns that have
been observed to be common.
Keywords: parallel file system, parallel I/O,
multiprocessor file system interface, pario-bib, dfk
Comment: See also nieuwejaar:galley.
Abstract: Most current multiprocessor file systems
are designed to use multiple disks in parallel, using the high aggregate
bandwidth to meet the growing I/O requirements of parallel scientific
applications. Many multiprocessor file systems provide applications with a
conventional Unix-like interface, allowing the application to access multiple
disks transparently. This interface conceals the parallelism within the file
system, increasing the ease of programmability, but making it difficult or
impossible for sophisticated programmers and libraries to use knowledge about
their I/O needs to exploit that parallelism. In addition to providing an
insufficient interface, most current multiprocessor file systems are
optimized for a different workload than they are being asked to support. We
introduce Galley, a new parallel file system that is intended to efficiently
support realistic scientific multiprocessor workloads. We discuss Galley's
file structure and application interface, as well as the performance
advantages offered by that interface.
Keywords: parallel file system, parallel I/O,
multiprocessor file system interface, pario-bib, dfk
Comment: A revised version of
nieuwejaar:jgalley-tr, which is a combination of nieuwejaar:galley and
nieuwejaar:galley-perf.
Abstract: Most current multiprocessor file systems
are designed to use multiple disks in parallel, using the high aggregate
bandwidth to meet the growing I/O requirements of parallel scientific
applications. Many multiprocessor file systems provide applications with a
conventional Unix-like interface, allowing the application to access multiple
disks transparently. This interface conceals the parallelism within the file
system, increasing the ease of programmability, but making it difficult or
impossible for sophisticated programmers and libraries to use knowledge about
their I/O needs to exploit that parallelism. In addition to providing an
insufficient interface, most current multiprocessor file systems are
optimized for a different workload than they are being asked to support. We
introduce Galley, a new parallel file system that is intended to efficiently
support realistic scientific multiprocessor workloads. We discuss Galley's
file structure and application interface, as well as the performance
advantages offered by that interface.
Keywords: parallel file system, parallel I/O,
multiprocessor file system interface, pario-bib, dfk
Abstract: As the I/O needs of parallel scientific
applications increase, file systems for multiprocessors are being designed to
provide applications with parallel access to multiple disks. Many parallel
file systems present applications with a conventional Unix-like interface
that allows the application to access multiple disks transparently. By
tracing all the activity of a parallel file system in a production,
scientific computing environment, we show that many applications exhibit
highly regular, but non-consecutive I/O access patterns. Since the
conventional interface does not provide an efficient method of describing
these patterns, we present an extension which supports strided and
nested-strided I/O requests.
Keywords: parallel I/O, multiprocessor file
system, pario-bib, dfk
Abstract: As the I/O needs of parallel scientific
applications increase, file systems for multiprocessors are being designed to
provide applications with parallel access to multiple disks. Many parallel
file systems present applications with a conventional Unix-like interface
that allows the application to access multiple disks transparently. By
tracing all the activity of a parallel file system in a production,
scientific computing environment, we show that many applications exhibit
highly regular, but non-consecutive I/O access patterns. Since the
conventional interface does not provide an efficient method of describing
these patterns, we present three extensions to the interface that support
strided, nested-strided, and nested-batched I/O requests.
We show how these extensions can be used to express common access patterns.
Keywords: parallel I/O, multiprocessor file
system, pario-bib, dfk
Comment: Identical to revised TR95-253,
nieuwejaar:strided2-tr. Cite nieuwejaar:strided2-book.
Abstract: As the I/O needs of parallel scientific
applications increase, file systems for multiprocessors are being designed to
provide applications with parallel access to multiple disks. Many parallel
file systems present applications with a conventional Unix-like interface
that allows the application to access multiple disks transparently. By
tracing all the activity of a parallel file system in a production,
scientific computing environment, we show that many applications exhibit
highly regular, but non-consecutive I/O access patterns. Since the
conventional interface does not provide an efficient method of describing
these patterns, we present three extensions to the interface that support
strided, nested-strided, and nested-batched I/O requests.
We show how these extensions can be used to express common access patterns.
Keywords: parallel I/O, multiprocessor file
system, pario-bib, dfk
Comment: Part of a whole book on parallel I/O; see
iopads-book and nieuwejaar:strided2 (which is not much different).
Abstract: As the I/O needs of parallel scientific
applications increase, file systems for multiprocessors are being designed to
provide applications with parallel access to multiple disks. Many parallel
file systems present applications with a conventional Unix-like interface
that allows the application to access multiple disks transparently. By
tracing all the activity of a parallel file system in a production,
scientific computing environment, we show that many applications exhibit
highly regular, but non-consecutive I/O access patterns. Since the
conventional interface does not provide an efficient method of describing
these patterns, we present three extensions to the interface that support
strided, nested-strided, and nested-batched I/O requests.
We show how these extensions can be used to express common access patterns.
Keywords: parallel I/O, multiprocessor file
system, pario-bib, dfk
Comment: After revision, identical to
nieuwejaar:strided2.
In this work we examine current multiprocessor file systems, as well as
how those file systems are used by scientific applications. Contrary to the
expectations of the designers of current parallel file systems, the workloads
on those systems are dominated by requests to read and write small pieces of
data. Furthermore, rather than being accessed sequentially and contiguously,
as in uniprocessor and supercomputer workloads, files in multiprocessor file
systems are accessed in regular, structured, but non-contiguous patterns.
Based on our observations of multiprocessor workloads, we have designed
Galley, a new parallel file system that is intended to efficiently support
realistic scientific multiprocessor workloads. In this work, we introduce
Galley and discuss its design and implementation. We describe Galley's new
three-dimensional file structure and discuss how that structure can be used
by parallel applications to achieve higher performance. We introduce several
new data-access interfaces, which allow applications to explicitly describe
the regular access patterns we found to be common in parallel file system
workloads. We show how these new interfaces allow parallel applications to
achieve tremendous increases in I/O performance. Finally, we discuss how
Galley's new file structure and data-access interfaces can be useful in
practice. Abstract: Most current multiprocessor file systems
are designed to use multiple disks in parallel, using the high aggregate
bandwidth to meet the growing I/O requirements of parallel scientific
applications. Most multiprocessor file systems provide applications with a
conventional Unix-like interface, allowing the application to access those
multiple disks transparently. This interface conceals the parallelism within
the file system, increasing the ease of programmability, but making it
difficult or impossible for sophisticated application and library programmers
to use knowledge about their I/O to exploit that parallelism. In addition to
providing an insufficient interface, most current multiprocessor file systems
are optimized for a different workload than they are being asked to support.
Keywords: parallel I/O, multiprocessor file
system, file system workload characterization, file access patterns, file
system interface, pario-bib
The design of a high-performance
multiprocessor file system requires a comprehensive understanding of the
expected workload. Unfortunately, until recently, no general workload studies
of multiprocessor file systems have been conducted. The goal of the CHARISMA
project was to remedy this problem by characterizing the behavior of several
production workloads, on different machines, at the level of individual reads
and writes. The first set of results from the CHARISMA project describe the
workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5. This
paper is intended to compare and contrast these two workloads for an
understanding of their essential similarities and differences, isolating
common trends and platform-dependent variances. Using this comparison, we are
able to gain more insight into the general principles that should guide
multiprocessor file-system design. Abstract: Phenomenal improvements in the
computational performance of multiprocessors have not been matched by
comparable gains in I/O system performance. This imbalance has resulted in
I/O becoming a significant bottleneck for many scientific applications. One
key to overcoming this bottleneck is improving the performance of
multiprocessor file systems.
Keywords: parallel I/O, file system workload,
workload characterization, file access pattern, multiprocessor file system,
dfk, pario-bib
Comment: See also kotz:workload,
nieuwejaar:strided, ap:workload.
The design of a high-performance parallel file system
requires a comprehensive understanding of the expected workload.
Unfortunately, until recently, no general workload studies of parallel file
systems have been conducted. The goal of the CHARISMA project was to remedy
this problem by characterizing the behavior of several production workloads,
on different machines, at the level of individual reads and writes. The first
set of results from the CHARISMA project describe the workloads observed on
an Intel iPSC/860 and a Thinking Machines CM-5. This paper is intended to
compare and contrast these two workloads for an understanding of their
essential similarities and differences, isolating common trends and
platform-dependent variances. Using this comparison, we are able to gain more
insight into the general principles that should guide parallel file-system
design. Abstract: Phenomenal improvements in the
computational performance of multiprocessors have not been matched by
comparable gains in I/O system performance. This imbalance has resulted in
I/O becoming a significant bottleneck for many scientific applications. One
key to overcoming this bottleneck is improving the performance of parallel
file systems.
Keywords: parallel I/O, file system workload,
workload characterization, file access pattern, multiprocessor file system,
dfk, pario-bib
Comment: See also nieuwejaar:strided, ap:workload.
Keywords: parallel file systems, parallel I/O,
pario-bib
Comment: From the abstract, it doesn't appear to
offer anything new, but it's hard to tell.
Keywords: supercomputer, file system, parallel
I/O, pario-bib
Comment: A modification to the Unix file system to
allow for supercomputer access. Workload: file size from few KB to few GB,
I/O operation size from few bytes to hundreds of MB. Generally programs split
into I/O-bound and CPU-bound parts. Sequential and random access. Needs:
giant files (bigger than device), peak hardware performance for large files,
NFS access. Their FS is built into Unix ``transparently''. Space allocated in
clusters, rather than blocks; clusters might be as big as a cylinder. Allows
for efficient, large files. Mentions parallel disks as part of a ``virtual
volume'' but does not elaborate. Prefetching within a cluster.
Keywords: parallel I/O, collective I/O, pario-bib
Comment: Part of jin:io-book; reformatted version
of nitzberg:collective.
The CFS at NAS can
sustain up to 7 megabytes per second writing and 8 megabytes per second
reading. However, due to the limited disk cache size, partitioned read
performance sharply drops to less than 1 megabyte per second on 128 nodes. In
addition, interleaved read and write performance show a similar drop in
performance for small block sizes. Although the CFS can sustain 70-80% of
peak I/O throughput, the I/O performance does not scale with the number of
nodes. Obtaining maximum performance may require significant programming
effort: pre-allocating files, overlapping computation and I/O, using large
block sizes, and limiting I/O parallelism. A better approach would be to
attack the problem by either fixing the CFS (e.g., add more cache to the I/O
nodes), or hiding its idiosyncracies (e.g., implement a parallel I/O
library). Abstract: Typical scientific applications require
vast amounts of processing power coupled with significant I/O capacity.
Highly parallel computer systems can provide processing power at low cost,
but tend to lack I/O capacity. By evaluating the performance and scalability
of the Intel iPSC/860 Concurrent File System (CFS), we can get an idea of the
current state of parallel I/O performance. I ran three types of tests on the
iPSC/860 system at the Numerical Aerodynamic Simulation facility (NAS):
broadcast, simulating initial data loading; partitioned, simulating reading
and writing a one-dimensional decomposition; and interleaved, simulating
reading and writing a two-dimensional decomposition.
Keywords: Intel, parallel file system, performance
measurement, parallel I/O, pario-bib
Comment: Straightforward measurements of an
iPSC/860 with 128 compute nodes, 10 I/O nodes, and 10 disks. This is a bigger
system than has been measured before. Has some basic MB/s measurements for
some features in Tables 1-2. CFS bug prevents more than 2 asynch requests at
a time. Another bug forced random-writes to use preallocated files. For low
number of procs, they weren't able to pull the full disk bandwidth. Cache
thrashing caused problems when they had a large number of procs, because each
read prefetched 8 blocks, which were flushed by some other proc doing a
subsequent read. Workaround by synchronizing procs to limit concurrency.
Increasing cache size is the right answer, but is not scalable.
We discuss our general model of
the problem, describe four Collective Buffering algorithms we designed, and
report experiments testing their performance on an Intel Paragon and an IBM
SP2 both housed at NASA Ames Research Center. Our experiments show
improvements of up to two order of magnitude over standard techniques and the
potential to deliver peak performance with minimal hardware support.
Abstract: "Parallel I/O" is the support of a
single parallel application run on many nodes; application data is
distributed among the nodes, and is read or written to a single logical file,
itself spread across nodes and disks. Parallel I/O is a mapping problem from
the data layout in node memory to the file layout on disks. Since the mapping
can be quite complicated and involve significant data movement, optimizing
the mapping is critical for performance.
Keywords: parallel I/O, collective I/O, pario-bib
This tutorial presents a
snapshot of the state of I/O on highly parallel systems by comparing the
well-balanced I/O performance of a traditional vector supercomputer (the Cray
Y/MP C90) with the I/O performance of various highly parallel systems (Cray
T3D, IBM SP-2, Intel iPSC/860 and Paragon, and Thinking Machines CM-5). In
addition, the tutorial covers benchmarking techniques for evaluating current
parallel I/O systems and techniques for improving parallel I/O performance.
Finally, the tutorial presents several high level parallel I/O libraries and
shows how they can help application programmers improve I/O performance.
Abstract: Typical scientific applications require
vast amounts of processing power coupled with significant I/O capacity.
Highly parallel computer systems provide floating point processing power at
low cost, but efficiently supporting a scientific workload also requires
commensurate I/O performance. In order to achieve high I/O performance, these
systems utilize parallelism in their I/O subsystems-supporting concurrent
access to files by multiple nodes of a parallel application, and striping
files across multiple disks. However, obtaining maximum I/O performance can
require significant programming effort.
Keywords: parallel I/O, tutorial, pario-bib
Abstract: Typical scientific applications require
vast amounts of processing power coupled with significant I/O capacity.
Highly parallel computer systems provide floating-point processing power at
low cost, but efficiently supporting a scientific workload also requires
commensurate I/O performance. To achieve high I/O performance, these systems
use parallelism in their I/O subsystems, supporting concurrent access to
files by multiple nodes of a parallel application and striping files across
multiple disks. However, obtaining maximum I/O performance can require
significant programming effort. This tutorial will present a comprehensive
survey of the state of the art in parallel I/O from basic concepts to recent
advances in the research community. Requirements, interfaces, architectures,
and performance will be illustrated using concrete examples from commercial
offerings (Cray T3D, IBM SP-2, Intel Paragon, Meiko CS-2, and workstation
clusters) and academic research projects (MPI-IO, Panda, PASSION, PIOUS, and
Vesta). The material covered is roughly 30% beginner, 60% intermediate, and
10% advanced.
Keywords: parallel I/O, tutorial, pario-bib
We attack this problem in three steps: we evaluate
an early parallel I/O system, the Intel iPSC/860 Concurrent File System; we
design and analyze the performance of two classes of algorithms taking
advantage of collective parallel I/O; and we design MPI-IO, a collective
parallel I/O interface likely to become the standard for portable parallel
I/O. The collective I/O algorithms fall into two broad categories: data
block scheduling and collective buffering. Data block scheduling algorithms
attempt to schedule the individual data transfers to minimize resource
contention and to optimize for particular hardware characteristics. We
develop and evaluate three data block scheduling algorithms: Grouping,
Random, and Sliding Window. The data block scheduling algorithms improved
performance by as much as a factor of eight. The collective buffering
algorithms permute the data before writing or after reading in order to
combine small file accesses into large blocks. We design and test a series of
four collective buffering algorithms and demonstrate improvement in
performance by two orders of magnitude over naive file I/O for the hardest,
three-dimensional distributions. Abstract: Parallel I/O, the process of
transferring a global data structure distributed among compute nodes to a
file striped across storage devices, can be quite complicated and involve a
significant amount of data movement. Optimizing parallel I/O with respect to
data distribution, file layout, and machine architecture is critical for
performance. In this work, we propose a solution to both the performance and
portability problems plaguing the wide acceptance of distributed memory
parallel computers for scientific computing: a collective parallel I/O
interface and efficient algorithms to implement it. A collective interface
allows the programmer to specify a file access as a high-level global
operation rather than as a series of seeks and writes. This not only provides
a more natural interface for the programmer, but also provides the system
with both the opportunity and the semantic information necessary to optimize
the file operation.
Keywords: parallel I/O, parallel algorithm, file
system interface, pario-bib
Comment: See also nitzberg:cfs and
corbett:mpi-overview.
In
this paper, we describe the design and implementation of SDM. With the help
of two parallel application templates, ASTRO3D and an Euler solver, we
illustrate how some of the design criteria affect performance. Abstract: Many scientific applications have large
I/O requirements, in terms of both the size of data and the number of files
or data sets. Management, storage, efficient access, and analysis of this
data present an extremely challenging task. Traditionally, two different
solutions are used for this problem: file I/O or databases. File I/O can
provide high performance but is tedious to use with large numbers of files
and large and complex data sets. Databases can be convenient, flexible, and
powerful but do not perform and scale well for parallel supercomputing
applications. We have developed a software system, called Scientific Data
Manager (SDM), that aims to combine the good features of both file I/O and
databases. SDM provides a high-level API to the user and, internally, uses a
parallel file system to store real data and a database to store
application-related metadata. SDM takes advantage of various I/O
optimizations available in MPI-IO, such as collective I/O and noncontiguous
requests, in a manner that is transparent to the user. As a result, users can
write and retrieve data with the performance of parallel file I/O, without
having to bother with the details of actually performing file I/O.
Keywords: scientific computing, database, parallel
I/O, pario-bib
Keywords: parallel I/O, pario-bib
Comment: see no:irregular2 and no:irregular.
Abstract: Due to the convergence of the fast
microprocessors with low latency and high bandwidth communication networks,
clusters of workstations are being used for high-performance computing. In
this paper we present the design and implementation of a runtime system to
support irregular applications on clusters of workstations, called
"Collective I/O Clustering". The system provides a friendly programming model
for performing I/O in irregular applications on clusters of workstations, and
is completely integrated with the underlying communication and I/O system.
All the performance results were obtained on the IBM-SP machine, located at
Argonne National Labs
Keywords: parallel I/O, irregular applications,
pario-bib
Keywords: parallel I/O, pario-bib
Abstract: We present an elegant deterministic load
balancing strategy for distribution sort that is applicable to a wide variety
of parallel disks and parallel memory hierarchies with both single and
parallel processors. The simplest application of the strategy is an optimal
deterministic algorithm for external sorting with multiple disks and parallel
processors. In each input/output (I/O) operation, each of the $D \geq 1$
disks can simultaneously transfer a block of $B$ contiguous records. Our two
measures of performance are the number of I/Os and the amount of work done by
the CPU(s); our algorithm is simultaneously optimal for both measures. We
also show how to sort deterministically in parallel memory hierarchies. When
the processors are interconnected by any sort of a PRAM, our algorithms are
optimal for all parallel memory hierarchies; when the interconnection network
is a hypercube, our algorithms are either optimal or best-known.
Keywords: parallel I/O algorithm, sorting, shared
memory, pario-bib
Comment: Short version of nodine:sort2 and
nodine:sortdisk.
Abstract: We present an optimal deterministic
algorithm for external sorting on multiple disks. Our measure of performance
is the number of input/output (I/O) operations. In each I/O, each disk can
simultaneously transfer a block of data. Our algorithm improves upon a recent
randomized optimal algorithm and the (non-optimal) commonly used technique of
disk striping. The code is simple enough for easy implementation.
Keywords: parallel I/O algorithms, sorting,
pario-bib
Comment: Summary is nodine:sort. This is revision
of CS-91-04.
Abstract: We present several load balancing
paradigms pertinent to optimizing I/O performance with disk and processor
parallelism. We use sorting as our canonical application to illustrate the
paradigms, and we survey a wide variety of applications in computational
geometry. The use of parallel disks can help overcome the I/O bottleneck in
sorting if the records in each read or write are evenly balanced among the
disks. There are three known load balancing paradigms that lead to optimal
I/O algorithms: using randomness to assign blocks to disks, using the disks
predominantly independently, and deterministically balancing the blocks by
matching. In this report, we describe all of these techniques in detail and
compare their relative advantages. We show how randomized and deterministic
balancing can be extended to provide sorting algorithms that are optimal both
in terms of the number of I/Os and the internal processing time for
parallel-processing machines with scalable I/O subsystems and with parallel
memory hierarchies. We also survey results achieving optimal performance in
the these models for a large range of online and batch problems in
computational geometry.
Keywords: parallel I/O algorithm, memory
hierarchy, load balance, sorting, pario-bib
Comment: Invited speaker: Jeffrey Vitter.
Keywords: parallel I/O algorithms, sorting,
pario-bib
Comment: They compare three techniques for
balancing I/O across parallel disks, using sorting as an example. The three
are randomization, using disks independently (as in balance sort), or tricky
matching techniques as in balance sort. They also look at parallel memory
hierarchies. All in all, it seems to be mostly a survey of techniques in
earlier papers.
Keywords: external sorting, file access pattern,
parallel I/O, pario-bib
Comment: Describes algorithms for external sorting
that are optimal in the number of I/Os. Proposes a couple of fairly-realistic
memory hierarchy models. See also journal version vitter:uniform.
Keywords: parallel I/O algorithms, parallel
memory, sorting, pario-bib
Comment: see nodine:deterministic.
Keywords: parallel I/O algorithms, sorting,
pario-bib
Comment: see nodine:deterministic.
Abstract: Existing parallel programming
environments for networks of workstations improve the performance of
computationally intensive applications by using message passing or virtual
shared memory to alleviate CPU bottlenecks. This paper describes an approach
based on message passing that addresses both CPU and I/O bottlenecks for a
specific class of distributed applications on ATM networks. ATM provides the
bandwidth required to utilize multiple I/O channels in parallel. This paper
also describes an environment based on distributed process management and
centralized application management that implements the approach. The
environment adds processes to a running application when necessary to
alleviate CPU and I/O bottlenecks while managing process connections in a
manner that is transparent to the application.
Keywords: parallel I/O, ATM, parallel networking,
pario-bib
A brief introduction to seismic processing will be
presented, and the implementation of a seismic-imaging code for distributed
memory computers will be discussed. The portable code, Salvo, performs a
wave-equation-based, 3-D, prestack, depth imaging and currently runs on the
Intel Paragon, the Cray T3D and SGI Challenge series. It uses MPI for
portability, and has sustained 22 Mflops/sec/proc (compiled FORTRAN) on the
Intel Paragon. Abstract: Fast, accurate imaging of complex,
oil-bearing geologies, such as overthrusts and salt domes, is the key to
reducing the costs of domestic oil and gas exploration. Geophysicists say
that the known oil reserves in the Gulf of Mexico could be significantly
increased if accurate seismic imaging beneath salt domes was possible. A
range of techniques exist for imaging these regions, but the highly accurate
techniques involve the solution of the wave equation and are characterized by
large data sets and large computational demands. Massively parallel computers
can provide the computational power for these highly accurate imaging
techniques.
Keywords: multiprocessor application, scientific
computing, seismic data processing, parallel I/O, pario-bib
Comment: 2 pages about their I/O scheme, mostly
regarding a calculation of the optimal balance between compute nodes and I/O
nodes to achieve perfect overlap.
Keywords: parallel application, scientific
computing, seismic data processing, parallel I/O, pario-bib, oldfield
Keywords: parallel application, scientific
computing, seismic data processing, parallel I/O, pario-bib, oldfield
Keywords: parallel architecture, shared memory,
supercomputer, parallel I/O, pario-bib
Comment: A MIMD, shared-memory machine, with
2-processor units embedded in a 3-d torus. Each link is bidirectional and
runs 300 MB/s. Processors are 150 MHz ALPHA, plus 16-64 MB RAM, plus a
memory interface unit. Global physical address space with remote-reference
and block-transfer capability. Not clear about cache coherency. Separate tree
network for global synchronization. Support for message send and optional
interrupt. I/O is all done through interface nodes that hook to the YMP host
and to its I/O clusters with 400 MB/s links. I/O is by default serialized,
but they do support a ``broadcast'' read operation (but see
pase:t3d-fortran). FORTRAN compiler supports the NUMA shared memory; PVM is
used for C and message passing.
Keywords: disk array, performance analysis,
pario-bib
Comment: Fairly complex analysis of a
multiprocessor attached to a disk array system through a central server that
is the buffer. Assumes task-oriented model for parallel system, where tasks
can be assigned to any CPU; this makes for an easy model. Like Reddy, they
compare declustering and striping (they call them striped and synchronized
disks).
Keywords: distributed file system, data storage,
mass storage, network-attached disks, Fibre Channel, pario-bib
Comment: position paper
Abstract: Scientific applications are increasingly
being implemented on massively parallel supercomputers. Many of these
applications have intense I/O demands, as well as massive computational
requirements. This paper is essentially an annotated bibliography of papers
and other sources of information about scientific applications using parallel
I/O. It will be updated periodically.
Keywords: parallel I/O application, file access
patterns, pario-bib, dfk
Abstract: High-performance distributed computing
appears to be shifting away from tightly-connected supercomputers to
computational grids composed of heterogeneous systems of networks, computers,
storage devices, and various other devices that collectively act as a single
geographically distributed virtual computer. One of the great challenges for
this environment is providing efficient parallel data access to remote
distributed datasets. In this paper, we discuss some of the issues associated
with parallel I/O and computatational grids and describe the design of a
flexible parallel file system that allows the application to control the
behavior and functionality of virtually all aspects of the file system.
Keywords: parallel I/O, Grid, parallel file
system, pario-bib
Comment: Named one of two "best" papers in the
Grid category.
Abstract: Scientific applications are increasingly
being implemented on massively parallel supercomputers. Many of these
applications have intense I/O demands, as well as massive computational
requirements. This paper is essentially an annotated bibliography of papers
and other sources of information about scientific applications using parallel
I/O.
Keywords: parallel I/O application, file access
patterns, pario-bib, dfk
Comment: Part of jin:io-book.
Abstract: This short report describes our
experiences using the Emulab network testbed at the University of Utah to
test performance of the Armada framework for parallel I/O on computational
grids.
Keywords: emulab, network emulation, Armada,
performance, dfk, pario-bib
Abstract: High-performance computing increasingly
occurs on ``computational grids'' composed of heterogeneous and
geographically distributed systems of computers, networks, and storage
devices that collectively act as a single ``virtual'' computer. A key
challenge in this environment is to provide efficient access to data
distributed across remote data servers. Our parallel I/O framework, called
Armada, allows application and data-set providers to flexibly compose graphs
of processing modules that describe the distribution, application interfaces,
and processing required of the dataset before computation. Although the
framework provides a simple programming model for the application programmer
and the data-set provider, the resulting graph may contain bottlenecks that
prevent efficient data access. In this paper, we present an algorithm used to
restructure Armada graphs that distributes computation and data flow to
improve performance in the context of a wide-area computational grid.
Keywords: parallel I/O, Grid computing,
distributed computing, graph algorithms, pario-bib
Abstract: While high performance computers tend to
be measured by their processor and communications speeds, the bottleneck for
many large-scale applications is the I/O performance rather than the
computational or communication performance. One such application is the
processing of 3D seismic data. Seismic data sets, consisting of recorded
pressure waves, can be very large, sometimes more than a terabyte in size.
Even if the computations can be performed in-core, the time required to read
the initial seismic data and velocity model and write images is substantial.
This paper will discuss our approach in handling the massive I/O requirements
of seismic processing and show the performance of our imaging code (Salvo) on
the Intel Paragon.
Keywords: parallel I/O application, pario-bib
Comment: In a Special Issue on I/O in Parallel
Applications, volume 12, numbers 3 and 4.
Abstract: High-performance computing increasingly
occurs on "computational grids" composed of heterogeneous and geographically
distributed systems of computers, networks, and storage devices that
collectively act as a single "virtual" computer. A key challenge in this
environment is to provide efficient access to data distributed across remote
data servers. This dissertation explores some of the issues associated with
I/O for wide-area distributed computing and describes an I/O system, called
Armada, with the following features: a framework to allow application and
dataset providers to flexibly compose graphs of processing modules that
describe the distribution, application interfaces, and processing required of
the dataset before or after computation; an algorithm to restructure
application graphs to increase parallelism and to improve network performance
in a wide-area network; and a hierarchical graph-partitioning scheme that
deploys components of the application graph in a way that is both beneficial
to the application and sensitive to the administrative policies of the
different administrative domains. Experiments show that applications using
Armada perform well in both low- and high-bandwidth environments, and that
our approach does an exceptional job of hiding the network latency inherent
in grid computing.
Keywords: parallel I/O, Grid computing, pario-bib
Abstract: High-performance computing increasingly
occurs on "computational grids" composed of heterogeneous and geographically
distributed systems of computers, networks, and storage devices that
collectively act as a single "virtual" computer. A key challenge in this
environment is to provide efficient access to data distributed across remote
data servers. This dissertation explores some of the issues associated with
I/O for wide-area distributed computing and describes an I/O system, called
Armada, with the following features: a framework to allow application and
dataset providers to flexibly compose graphs of processing modules that
describe the distribution, application interfaces, and processing required of
the dataset before or after computation; an algorithm to restructure
application graphs to increase parallelism and to improve network performance
in a wide-area network; and a hierarchical graph-partitioning scheme that
deploys components of the application graph in a way that is both beneficial
to the application and sensitive to the administrative policies of the
different administrative domains. Experiments show that applications using
Armada perform well in both low- and high-bandwidth environments, and that
our approach does an exceptional job of hiding the network latency inherent
in grid computing.
Keywords: parallel I/O, Grid computing, pario-bib
Keywords: I/O benchmark, transaction processing,
pario-bib
Comment: See wolman:iobench. Used IOBENCH to
compare normal disk configuration with striped disks, RAID level 1, and RAID
level 5, under a random I/O workload. Multiple disks with files on different
disks gave good performance (high throughput and low response time) when
multiple users. Striping ensures balanced load, similar performance. RAID
level 1 or level 5 ensures reliability at performance cost over striping, but
still good. Especially sensitive to write/read ratio - performance lost for
large number of writes.
Keywords: parallel I/O, multiprocessor file
system, parallel database, pario-bib
Comment: A custom multiprocessor with a
shared-memory clusters networked together and to shared disks. Runs Mach.
Directory-based coherence protocol for the distributed file system.
Background writeback.
Keywords: disk array, multimedia, parallel I/O,
pario-bib
Keywords: compilers, parallel I/O, out-of-core
applications, pario-bib
Comment: They are developing extensions to the
FortranD compiler so that it supports I/O-related directives for out-of-core
computations. The compiler then analyzes the computation, inserts the
necessary I/O calls, and optimizes the I/O. They hand-compile a red-black
relaxation program and an LU-factorization program. I/O was much faster than
VM, particularly because they were able to make large requests rather than
faulting on individual pages. Overlapping I/O and computation was also a big
win. See also kennedy:sio, bordawekar:model.
Keywords: object-based storage, distributed file
system, parallel file system, pario-bib
Comment: The paper describes the architecture of
proprietary object-based storage system for clusters-an extension of Garth
Gibson's NASD work at CMU (see gibson:nasd-tr). Similar to Lustre
(cfs:lustre, braam:lustre-arch).
Keywords: RAID, disk array, parallel I/O,
pario-bib
Keywords: multimedia, video on demand, parallel
I/O, pario-bib
Comment: Part of a special issue on parallel and
distributed I/O.
Abstract: In this paper, we deal with the
data/parity placement problem which is described as follows: how to place
data and parity evenly across disks in order to tolerate two disk failures,
given the number of disks N and the redundancy rate p which represents the
amount of disk spaces to store parity information. To begin with, we
transform the data/parity placement problem into the problem of constructing
an N x N matrix such that the matrix will correspond to a solution to the
problem. The method to construct a matrix has been proposed and we have shown
how our method works through several illustrative examples. It is also shown
that any matrix constructed by our proposed method can be mapped into a
solution to the placement problem if a certain condition holds between N and
p where N is the number of disks and p is a redundancy rate.
Keywords: parallel I/O, disk array, reliability,
fault tolerance, pario-bib
Abstract: Scientific applications often require
some strategy for temporary data storage to do the largest possible
simulations. The use of virtual memory for temporary data storage has
received criticism because of performance problems. However, modern virtual
memory found in recent operating systems such as Cenju-3/DE give application
writers control over virtual memory policies. We demonstrate that custom
virtual memory policies can dramatically reduce virtual memory overhead and
allow applications to run out-of-core efficiently. We also demonstrate that
the main advantage of virtual memory, namely programming simplicity, is not
lost.
Keywords: virtual memory, file interface,
scientific applications, out-of-core, parallel I/O, pario-bib
Comment: Web and CDROM only. They advocate the use
of traditional demand-paged virtual memory systems in supporting out-of-core
applications. They are implementing an operating system for the NEC
Cenju-3/DE, a shared-nothing MIMD multiprocessor with a multistage
interconnection network and disks on every node. The operating system is
based on Mach, and they have extended Mach to allow user-provided [local]
replacement policies. Basically, they argue that you can get good performance
as long as you write your own replacement policy (even OPT is possible in
certain applications), and that this is easier than (re)writing the
application with explicit out-of-core file I/O calls. They measure the
performance of two applications on their system, with OPT, FIFO, and a new
replacement algorithm customized to one of the applications. They show that
they can get much better performance with some replacement policies than with
others, but despite the paper's title they do not compare with the
performance of an equivalent program using file I/O.
Keywords: parallel I/O, reliability, RAID,
pario-bib
Comment: They use ECC with one or more parity
drives in bit-interleaved systems, and on-line regeneration of failed drives
from spares. More cost-effective than mirrored disks. One of the earliest
references to RAID-like concepts. Basically, they describe RAID3.
Keywords: parallel I/O, pario-bib
Two sample parallel programs using these
templates are compared against versions implemented in an existing parallel
I/O system (PIOUS). The sample programs show that the use of parallel I/O
templates are beneficial from both the performance and software engineering
points of view. Abstract: This paper presents a novel, top-down,
high-level approach to parallelizing file I/O. Each parallel file descriptor
is annotated with a high-level specification, or template, of the expected
parallel behaviour. The annotations are external to and independent of the
source code. At run-time, all I/O using a parallel file descriptor adheres to
the semantics of the selected template. By separating the parallel I/O
specifications from the code, a user can quickly change the I/O behaviour
without rewriting code. Templates can be composed hierarchically to construct
complex access patterns.
Keywords: parallel programming, parallel I/O,
pario-bib
Comment: An interesting approach in which they try
to separate the description of the parallelism in a file's access from the
sequential programming used to access the file. Seems like a good idea. It
seems to assume that the programmer was porting an existing sequential code,
or prefers to write their parallel program with a sequential frame of mind,
including the existing fopen/fread/fwrite stdio interface. They retain the
traditional stream-of-bytes file structure. See also parsons:complex.
Abstract: This report describes the MPP Fortran
programming model which will be supported on the first phase MPP systems.
Based on existing and proposed standards, it is a work sharing model which
combines features from existing models in a way that may be both efficiently
implemented and useful.
Keywords: compiler, parallel language,
supercomputing, parallel I/O, pario-bib
Comment: See also oed:t3d for T3D overview. I only
read the part about I/O. The only I/O support, apparently, is for each
processor to open and access the file independently from all other
processors.
Abstract: Understanding the characteristic I/O
behavior of scientific applications is an integral part of the research and
development efforts for the improvement of high performance I/O systems. This
study focuses on application level I/O behavior with respect to both static
and dynamic characteristics. We observed the San Diego Supercomputer Center's
Cray C90 workload and isolated the most I/O intensive applications. The
combination of a low-level description of physical resource usage and the
high-level functional composition of applications and scientific disciplines
for this set reveals the major sources of I/O demand in the workload. We
selected two applications from the I/O intensive set and performed a detailed
analysis of their dynamic I/O behavior. These applications exhibited a high
degree of regularity in their I/O activity over time and their characteristic
I/O behaviors can be precisely described by one and two, respectively,
recurring sequences of data accesses and computation periods.
Keywords: parallel I/O, pario-bib
Keywords: scientific computing, file access
patterns, I/O, pario-bib
Comment: This paper extends some of their previous
results, but the real bottom line here is that some scientific applications
do a lot of I/O, the I/O us bursty, and the pattern of bursts is cyclic and
regular. Seems like this cyclic nature could be a source of some
optimization. Included in the parallel I/O bibliography because it is useful
to that community, though they did not trace parallel workload.
Keywords: scientific computing, file access
patterns, pario-bib
Comment: Analyzed one month of accounting records
from Cray YMP8/864 in SDSC's production environment. Their base assumption is
that scientific application I/O is regular and predictable, eg, repetitive
periodic bursts, with distinct phases, repeating patterns, and sequential
access. The goal is to characterize a set of I/O-intensive scientific
applications and evaluate regularity of resource usage. They measure volumes
and rates of applications and total system. Cumulative and average usage for
each distinct non-system application. Most resource usage came from the 5%
of applications that were not system applications. ``Virtual I/O rate'' is
the bytes transferred per CPU second, which is IMHO only a rough measure
because sometimes I/O overlaps CPU time, and sometimes does not. They picked
out long-running applications with a high virtual I/O rate. Top 50
applications had 71% of bytes transferred and 10% of CPU time. Of those,
4.66 MB/sec min, 131 MB/sec max. Of those they picked the ones executed most
often. Cluster analysis showed only 1-2 clusters. Correlation between I/O and
CPU time. Included in the parallel I/O bibliography because it is useful to
that community, though they did not trace parallel workload.
Keywords: design, parallel file system, parallel
I/O, pario-bib
Comment: Describes the requirements and desired
performance features of a parallel file system designed for the DOE ASCI
computers.
Keywords: I/O, file system, parallel I/O,
pario-bib
Comment: This is the intro to a special issue on
I/O.
Keywords: caching, prefetching, file system,
hints, I/O, resource management, parallel I/O, pario-bib
Comment: Part of jin:io-book; reformatted version
of patterson:informed.
Keywords: parallel I/O, RAID, reliability, cost
analysis, I/O bottleneck, disk array, pario-bib
Comment: Part of jin:io-book; reformatted version
of patterson:raid.
Abstract: In this paper, we present aggressive,
proactive mechanisms that tailor file system resource management to the needs
of I/O-intensive applications. In particular, we show how to use
application-disclosed access patterns (hints) to expose and exploit I/O
parallelism, and to dynamically allocate file buffers among three competing
demands: prefetching hinted blocks, caching hinted blocks for reuse, and
caching recently used data for unhinted accesses. Our approach estimates the
impact of alternative buffer allocations on application execution time and
applies cost-benefit analysis to allocate buffers where they will have the
greatest impact. We have implemented informed prefetching and caching in
Digitals OSF/1 operating system and measured its performance on a 150 MHz
Alpha equipped with 15 disks running a range of applications. Informed
prefetching reduces the execution time of text search, scientific
visualization, relational database queries, speech recognition, and object
linking by 20-83%. Informed caching reduces the execution time of
computational physics by up to 42% and contributes to the performance
improvement of the object linker and the database. Moreover, applied to
multiprogrammed, I/O-intensive workloads, informed prefetching and caching
increase overall throughput.
Keywords: caching, prefetching, file system,
hints, I/O, resource management, parallel I/O, pario-bib
Comment: See patterson:informed-tr for an earlier
version. Programs may give hints to the file system about what they will read
in the future, and in what order. Hints are used for informed prefetching and
informed caching. Most interesting thing about this paper over the earlier
ones is the buffer management. Prefetcher and demand fetcher both want
buffers. LRU cache and hinted cache both could supply buffers (thru
replacement). Each supplies a cost for giving up buffers and benefit for
getting more buffers. These are expressed in a common 'currency', in terms of
their expected effect on I/O service time, and a manager takes buffers from
one and gives buffers to another when the benefits outweigh the costs. All is
based on a simple model, which is further simplified in their implementation
within OSF/1. Performance looks good, they can keep more disks busy in a
parallel file system. Furthermore, informed caching helps reduce the number
of I/Os. Indeed they 'discover' MRU replacement policy automatically.
Abstract: The underutilization of disk parallelism
and file cache buffers by traditional file systems induces I/O stall time
that degrades the performance of modern microprocessor-based systems. In this
paper, we present aggressive mechanisms that tailor file system resource
management to the needs of I/O-intensive applications. In particular, we show
how to use application-disclosed access patterns (hints) to expose and
exploit I/O parallelism, and to dynamically allocate file buffers among three
competing demands: prefetching hinted blocks, caching hinted blocks for
reuse, and caching recently used data for unhinted accesses. Our approach
estimates the impact of alternative buffer allocations on application
execution time and applies a cost-benefit analysis to allocate buffers where
they will have the greatest impact. We implemented informed prefetching and
caching in DEC's OSF/1 operating system and measured its performance on a 150
MHz Alpha equipped with 15 disks. When running a range of applications
including text search, 3D scientific visualization, relational database
queries, speech recognition, and computational chemistry, informed
prefetching reduces the execution time of four of these applications by 20 to 87%. Informed caching reduces the execution time of the fifth application
by up to 30%.
Keywords: caching, prefetching, file system,
hints, I/O, resource management, parallel I/O, pario-bib
Keywords: parallel I/O, file prefetching, file
caching, pario-bib
Comment: This 'paper' is really an annotated set
of slides.
Abstract: Informed prefetching provides a simple
mechanism for I/O-intensive, cache-ineffective applications to efficiently
exploit highly-parallel I/O subsystems such as disk arrays. This mechanism,
dynamic disclosure of future accesses, yields substantial benefits over
sequential readahead mechanisms found in current file systems for non-sequen
tial workloads. This paper reports the performance of the Transparent
Informed Prefetching system (TIP), a minimal prototype implemented in a Mach
3.0 system with up to four disks. We measured reductions by factors of up to
1.9 and 3.7 in the execution time of two example applications: multi-file
text search and scientific data visualization.
Keywords: prefetching, parallel I/O, pario-bib
Comment: Also available in HTML format at
http://www.cs.cmu.edu/Web/Groups/PDL/HTML-Papers/PDIS94/final.fm.html.
Keywords: parallel I/O, RAID, reliability, cost
analysis, I/O bottleneck, disk array, pario-bib
Comment: Make a good case for the upcoming I/O
crisis, compare single large expensive disks (SLED) with small cheap disks.
Outline five levels of RAID the give different reliabilities, costs, and
performances. Block-interleaved with a single check disk (level 4) or with
check blocks interspersed (level 5) seem to give best performance for
supercomputer I/O or database I/O or both. Note: the TR by the same name
(UCB/CSD 87/391) is essentially identical.
Keywords: parallel I/O, RAID, reliability, cost
analysis, I/O bottleneck, disk array, pario-bib
Comment: A short version of patterson:raid, with
some slight updates.
Abstract: Computerized data has become critical to
the survival of an enterprise. Companies must have a strategy for recovering
their data should a disaster such as a fire destroy the primary data center.
Current mechanisms offer data managers a stark choice: rely on affordable
tape but risk the loss of a full day of data and face many hours or even days
to recover, or have the benefits of a fully synchronized on-line remote
mirror, but pay steep costs in both write latency and network bandwidth to
maintain the mirror. In this paper, we argue that asynchronous mirroring, in
which batches of updates are periodically sent to the remote mirror, can let
data managers find a balance between these extremes. First, by eliminating
the write latency issue, asynchrony greatly reduces the performance cost of a
remote mirror. Second, by storing up batches of writes, asynchronous
mirroring can avoid sending deleted or overwritten data and thereby reduce
network bandwidth requirements. Data managers can tune the update frequency
to trade network bandwidth against the potential loss of more data. We
present Snap-Mirror, an asynchronous mirroring technology that leverages file
system snapshots to ensure the consistency of the remote mirror and optimize
data transfer. We use traces of production filers to show that even updating
an asynchronous mirror every 15 minutes can reduce data transferred by
30{PCT} to 80{PCT}. We find that exploiting file system knowledge of
deletions is critical to achieving any reduction for no-overwrite file
systems such as WAFL and LFS. Experiments on a running system show that using
file system metadata can reduce the time to identify changed blocks from
minutes to seconds compared to purely logical approaches. Finally, we show
that using SnapMirror to update every 30 minutes increases the response time
of a heavily loaded system only 22{PCT}. dollars depending on the size of the
enterprise and the role of the data. With increasing frequency, companies are
instituting disaster recovery plans to ensure appropriate data availability
in the event of a catastrophic failure or disaster that destroys a site (e.g.
flood, fire, or earthquake). It is relatively easy to provide redundant
server and storage hardware to protect against the loss of physical
resources. Without the data, however, the redundant hardware is of little
use.
Keywords: file systems, pario-bib
Abstract: This paper focuses on extending the
power of caching and prefetching to reduce file read latencies by exploiting
application level hints about future I/O accesses. We argue that systems that
disclose high-level knowledge can transfer optimization information across
module boundaries in a manner consistent with sound software engineering
principles. Such Transparent Informed Prefetching (TIP) systems provide a
technique for converting the high through put of new technologies such as
disk arrays and log-structured file systems into low latency for
applications. Our preliminary experiments show that even without a
high-throughput I/O sub system TIP yields reduced execution time of up to 30 for applications obtaining data from a remote file server and up to 13% for
applications obtaining data from a single local disk. These experiments
indicate that greater performance benefits will be available when TIP is
integrated with low level resource management policies and highly parallel
I/O subsystems such as disk arrays.
Keywords: file system, prefetching, operating
system, pario-bib
Comment: Not much new over previous TIP papers,
but does have newer numbers. See patterson:tip1. Also appears in DAGS'93
(patterson:tip2). Previously appeared as TR CMU-CS-93-1.
Abstract: This paper focuses on extending the
power of caching and prefetching to reduce file read latencies by exploiting
application level hints about future I/O accesses. We argue that systems that
disclose high-level knowledge can transfer optimization information across
module boundaries in a manner consistent with sound software engineering
principles. Such Transparent Informed Prefetching (TIP) systems provide a
technique for converting the high throughput of new technologies such as disk
arrays and log-structured file systems into low latency for applications. Our
preliminary experiments show that even without a high-throughput I/O
sub-system TIP yields reduced execution time of up to 30% for applications
obtaining data from a remote file server and up to 13% for applications
obtaining data from a single local disk. These experiments indicate that
greater performance benefits will be available when TIP is integrated with
low level resource management policies and highly parallel I/O subsystems
such as disk arrays.
Keywords: file system, prefetching, operating
system, pario-bib
Comment: Invited speaker: Garth Gibson. Similar
paper appeared in ACM OSR April 1993 (patterson:tip)
Abstract: RISC pioneer and UC, Berkeley Computer
Science Professor David Patterson is working to develop input/output systems
to match the increasingly higher performance of new processors. Here he
describes the results of the RAID (Redundant Arrays of Inexpensive Disks)
project, which offers much greater performance, capacity, and reliability
from I/O systems. Patterson also discusses a new project, Sequoia 2000, which
looks at utilizing small helical scan tapes, such as digital-audiotapes or
videotapes, to offer terabytes of storage for the price of a file/server. He
believes that a 1000x increase in storage, available on most Ethernets, will
have a much greater impact than a 1000x increase in processing speed.
Keywords: videotape, computer architecture,
parallel I/O, pario-bib
Comment: See patterson:trends. 58 minutes.
Keywords: sorting, parallel I/O algorithm,
pario-bib
Comment: Main contribution appears to be a new
sampling method for initial partition of data set. They approach it from a
database point of view.
Abstract: This paper discusses our implementation
of Rajasekaran's (l,m)-mergesort algorithm (LMM) for sorting on parallel
disks. LMM is asymptotically optimal for large problems and has the
additional advantage of a low constant in its I/O complexity. Our
implementation is written in C using the ViC* I/O API for parallel disk
systems. We compare the performance of LMM to that of the C library function
qsort on a DEC Alpha server. qsort makes a good benchmark because it is fast
and performs comparatively well under demand paging. Since qsort fails when
the swap disk fills up, we can only compare these algorithms on a limited
range of inputs. Still, on most out-of-core problems, our implementation of
LMM runs between 1.5 and 1.9 times faster than qsort, with the gap widening
with increasing problem size.
Keywords: parallel I/O, out of core, sorting,
parallel algorithm, pario-bib
Comment: Undergraduate Honors Thesis. Advisor: Tom
Cormen.
Abstract: Distributed filesystems are a typical
solution in networked environments as clusters and grids. Parallel
filesystems are a typical solution in order to reach high performance I/O
distributed environment, but those filesystems have some limitations in
heterogeneous storage systems. Usually in distributed systems, load balancing
is used as a solution to improve the performance, but typically the
distribution is made between peer-to-peer computational resources and from
the processor point of view. In heterogeneous systems, like heterogeneous
clusters of workstations, the existing solutions do not work so well.
However, the utilization of those systems is more extended every day, having
an extreme example in the grid environment. In this paper we bring attention
to those aspects of heterogeneous distributed data systems presenting a
parallel file system that take into account heterogeneity of storage nodes,
the dynamic addition of new storage nodes, and an algorithm to group requests
in heterogeneous systems.
Keywords: parallel I/O, load balancing, pario-bib
Abstract: Association rules are very useful and
interesting patterns in many data mining scenarios. Apriori algorithm is the
best- known association rule algorithm. This algorithm interacts with a
storage system in order to access input data and output the results. This
paper shows how to optimize this algorithm adapting the underlying storage
system to this problem through the usage of hints and parallel features.
Keywords: parallel I/O, pario-bib
Abstract: This document describes the detailed
design of the CLFS, one of the components of the Cache Coherent File System
(CCFS). CCFS has three main components: Client File Server (CLFS), Local File
Server (LFS), Concurrent Disk System (CDS). The Client File Servers are
located on each processing node, to develop file manager functions in a per
node basis. The CLFS will interact with the LFSs to provide block services,
naming, locking, real input/output and to manage the disk system, partitions,
distributed partitions, etc. The CLFS includes a standard POSIX interface
(internally parallelized) and some parallel extensions It will be responsible
of maintaining cache consistency, distributing accesses to servers, providing
a file system interface to the user, etc.
Keywords: multiprocessor file system, parallel
I/O, pario-bib
Comment: See carretero:*, rosales:cds, perez:clfs.
Keywords: multi-agent parallel file system,
pario-bib
Abstract: We present an overview of ParFiSys, a
coherent parallel file system developed at the UPM to provide I/O services to
the GPMIMD machine, an MPP built within the ESPRIT project P-5404. Special
emphasis is made on the results obtained during ParFiSys evaluation. They
were obtained using several I/O benchmarks (PARKBENCH, IOBENCH, etc.) and
several MPP platforms (T800, T9000, etc.) to cover a big spectrum of the
ParFiSys features, being specifically oriented to measure throughput for
scientific applications I/O patterns. ParFiSys is specially well suited to
provide I/O services to scientific applications requiring high I/O bandwidth,
to minimize application porting effort, and to exploit the parallelism of
generic message-passing multicomputers.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Abstract: The philosophy behind grid is to use
idle resources to achieve a higher level of computational services
(computation, storage, etc). Existing data grids solutions are based in new
servers, specific APIs and protocols, however this approach is not a
realistic solution for enterprises and universities, because this supposes
the deployment of new data servers across the company. This paper describes a
new approach to data access in computational grids. This approach is called
GridExpand, a parallel I/O middleware that integrates heterogeneous data
storage resources in grids. The proposed grid solution integrates available
data network solutions (NFS, CIFS, WebDAV) and makes possible the access to a
global grid file system. Our solution differs from others because it does not
need the installation of new data servers with new protocols. Most of the
data grid solutions use replication as the way to obtain high performance.
Replication, however, introduce consistency problem for many collaborative
applications, and sometimes requires the usage of lots of resources. To
obtain high performance, we apply the parallel I/O techniques used in
parallel file systems.
Keywords: Data Grids, Parallel I/O, data
declustering, High performance I/O, pario-bib
Comment: A short paper describing an adaptation of
the Expand parallel file system for data grids. Also see the related paper
garcia:expand-design.
Abstract: Existing parallel file systems provide
applications a little control for optimizing I/O accesses. Most of these
systems use optimization techniques transparent to the applications, limiting
the performance achieved by these solutions. Furthermore, there is a big gap
between the interface provided by parallel file systems and the needs of
applications. In fact, most of the parallel file systems do not use intuitive
I/O hints or other optimizations approaches. In this sense, applications
programmers cannot take advantage of optimization techniques suitable for the
application domain. This paper describes I/O optimizations techniques used in
MAPFS, a multiagent I/O architecture. These techniques are configured by
means of a double interface for specifying access patterns or hints that
increase the performance of I/O operations. An example of this interface is
shown.
Keywords: parallel I/O, optimizations, caching,
prefetching, hints, pario-bib
Keywords: parallel I/O architecure, pario-bib
Comment: Part of jin:io-book.
Keywords: parallel programming, parallel
architecture, parallel I/O, pario-bib
Comment: A language- and application-driven
proposal for parallel architecture, that mixes SIMD and MIMD,
high-performance networking, large memory, shared address space, and so
forth. Fairly convincing arguments. One disk per node. Little mention of a
file system though. Email from student Udo Boehm:``We use in the version of
Triton/1 with 256 PE's 72 Disks at the moment (the filesystem is scalable up
to 256 Disks). These Disks are divided into 8 Groups with 9 Disks. In each
group exists one parity disk. Our implementation of the filesystem is an
parallel version of RAID Level 3 with some extensions. We use so called
vector files for diskaccess. A file is always distributed over all disks of
the diskarray. A vectorfile is divided in logical blocks. A logical block
exist of 72 physical blocks, each block is on one of the 72 disks and all
these 72 physical blocks have the same blocknumber on each disk. A logical
block has 18432 Bytes, where 16384 Bytes are for Data. The filesystem uses
these logical blocks to save data. We do not use special PE's for the I/O.
All PE's can be (are) used to do I/O ! There exists no central which
coordinates the PE's.''
Keywords: parallel I/O, hypercube, Intel iPSC/2,
multiprocessor file system, pario-bib
Comment: Intel iPSC/2 Concurrent File System.
Chose to tailor system for high performance for large files, read in large
chunks. Uniform logical file system view, Unix stdio interface. Blocks
scattered over all disks, but not striped. Blocksize 4K optimizes
message-passing performance without using blocks that are too big.
Tree-directory is stored in ONE file and managed by ONE process, so opens are
bottlenecked, but that is not their emphasis. File headers, however, are
scattered. The file header info contains a list of blocks. File header is
managed by disk process on its I/O node. Data caching is done only at the I/O
node of the originating disk drive. Read-ahead is used but not detailed here.
Abstract: The article describes the design and
implementation of parallel I/O in the object-oriented message-passing library
TPO++. TPO++ is implemented on top of the message passing standard MPI and
provides an object-oriented, type-safe and data centric interface to
message-passing. Starting with version 2, the MPI standard defines primitives
for parallel I/O called MPI-IO. Based on this layer, we have implemented an
object-oriented parallel I/O interface in TPO++. The project is part of our
efforts to apply object-oriented methods to the development of parallel
physical simulations. We give a short introduction to our message-passing
library and detail its extension to parallel I/O. Performance measurements
between TPO++ and MPI are compared and discussed.
Keywords: object-oriented message passing, TPO++,
parallel I/O interface, pario-bib
Abstract: One of the crucial problems in image
processing is Image Matching, i.e., to match two images, or in our case, to
match a model with the given image. This problem being highly computation
intensive, parallel processing is essential to obtain the solutions in time
due to real world constraints. The Hausdorff method is used to locate human
beings in images by matching the image with models and is parallelized with
MPI. The images are usually stored in files with different formats. As most
of the formats can be converted into ASCII file format containing integers,
we have implemented 3 strategies namely, Normal File Reading, Off-line
Conversion and Run-time Conversion for free format integer file reading and
writing. The parallelization strategy is optimized so that I/O overheads are
minimal. The relative performances with multiple processors are tabulated for
all the cases and discussed. The results obtained demonstrate the efficiency
of our strategies and the implementations will enhance the file
interoperability which will be useful for image processing community to use
parallel systems to meet the real time constraints.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Keywords: parallel I/O, pario-bib, multiprocessor
file system, file access pattern, checkpoint
Comment: Goal is to collect a set of
representative applications from biology, chemistry, earth science,
engineering, graphics, and physics, use performance-monitoring tools to
analyze them, create templates and benchmarks that represent them, and then
later to evaluate the performance of new I/O tools created by rest of the SIO
initiative. Seem to be four categories of I/O needs: input, output,
checkpoint, and virtual memory (``out-of-core'' scratch space). Not all types
are significant in all applications. (Two groups mention databases and the
need to perform computationally complex queries.) Large input is typically
raw data (seismic soundings, astronomical observations, satellite remote
sensing, weather information). Sometimes there are real-time constraints.
Output is often periodic, e.g., the state of the system every few timesteps;
typically the volume would increase along with I/O capacity and bandwidth.
Checkpointing is a common request; preferably allowing application to choose
what and when to checkpoint, and definitely including the state of files.
Many kinds of out-of-core: 1) temp files between passes (often written and
read sequentially), 2) regular patterns like FFT, matrix transpose, solvers,
and single-pass read/compute/write, 3) random access, e.g., to precomputed
tables of integrals. Distinct differences in the ways people choose to divide
data into files; sometimes all in one huge file, sometimes many ``small''
files (e.g., one per processor, one per timestep, one per region, etc.).
Important: overlap of computation and I/O, independent access by individual
processors. Not always important: ordering of records read or written by
different processors, exposing the I/O model to the application writer. Units
of I/O seem to be either (sub)matrices (1-5 dimensions) or items in a
collection of objects (100-10000 bytes each). Data sets varied up to 1 TB;
bandwidth needs varied up to 1 GB/s. See also bagrodia:sio-character,
choudhary:sio-language, bershad:sio-os.
Abstract: In large-scale discrete-event
simulations the size of a computer's physical memory limits the size of the
system to be simulated. Demand paging policies that support virtual memory
are generally ineffective. Use of parallel processors to execute the
simulation compounds the problems, as memory can be tied down due to
synchronization needs. We show that by taking more direct control of disks it
is possible to break through the memory bottleneck, without significantly
increasing overall execution time. We model one approach to conducting
out-of-core parallel simulation, identifying relationships between execution,
memory, and I/O costs that admit good performance.
Keywords: discrete-event simulation, parallel
computing, out-of-core application, parallel I/O, pario-bib
Keywords: file system, unix, parallel I/O, disk
striping, pario-bib
Comment: A new file system for Unix based on
striped files. Better performance for sequential access, better for
large-file random access and about the same for small-file random access.
Allows full striping track prefetch, or even volume prefetch. Needs a little
bit of buffer management change. Talks about buffer management and parity
blocks.
Abstract: With rapid advances in computer and
communication technologies, there is an increasing demand to build and
maintain large image repositories. In order to reduce the demands on I/O and
network resources, multiresolution representations are being proposed for the
storage organization of images. Image decomposition techniques such as
wavelets can be used to provide these multiresolution images. The original
image is represented by several coefficients, one of them with visual
similarity to the original image, but at a lower resolution. These visually
similar coefficients can be thought of as thumbnails or icons of
the original image. This paper addresses the problem of storing these
multiresolution coefficients on disks so that thumbnail browsing as well as
image reconstruction can be performed efficiently. Several strategies are
evaluated to store the image coefficients on parallel disks. These strategies
can be classified into two broad classes depending on whether the access
pattern of the images is used in the placement. Disk simulation is used to
evaluate the performance of these strategies. Simulation results are
validated with results from experiments with real disks and are found to be
in good agreement. The results indicate that significant performance
improvements can be achieved with as few as four disks by placing image
coefficients based upon browsing access patterns.
Keywords: multimedia, parallel I/O, pario-bib
Comment: They use simulation to study several
different placement policies for the thumbnail and varying-resolution
versions of images on a disk array.
Keywords: parallel I/O, Intel iPSC/2, nCUBE,
pario-bib
Comment: Simple comparison of the iPSC/2 and
nCUBE/10 parallel I/O systems. Short description of each system, with simple
transfer rate measurements. See also french:ipsc2io-tr.
Keywords: Linux, shared file system,
network-attached disks, disk striping, parallel I/O, pario-bib
Comment: They discuss a shared, serverless, file
system for Linux that integrates IP-based network attached storage and
Fibre-Channel- based storage area networks. Based on soltis:gfs.
Keywords: parallel I/O, message-passing,
multiprocesor file system interface, pario-bib
Comment: See newer version mpi-ioc:mpi-io5.
Keywords: RAID, disk array, parallel I/O,
pario-bib
Comment: Basically, an educational piece about the
basics of RAID technology. Helps to define terms across the industry. Written
by the RAID advisory board, which is an industry consortium. Overviews RAID,
RAID levels, non-Berkeley RAID levels. List of Board members. Bibliography.
Keywords: parallel I/O benchmarks, MPI-IO,
pario-bib
Keywords: MPI-IO, MPI, parallel I/O, pario-bib
Abstract: Several models of parallel disks are
found in the literature. These models have been proposed to alleviate the I/O
bottleneck arising in handling voluminous data. These models have the general
theme of assuming multiple disks. For instance the parallel disks model
assumes D disks and a single computer. It is also assumed that a block of
data from each of the D disks can be fetched into the main memory in one
parallel I/O operation. In this paper, we study a model where there are more
than one processors and each processor has an associated disk. In addition to
the I/O cost, one also has to account for the inter-processor communication
costs. To begin with we study the mesh and we investigate the performance of
the mesh with respect to out-of-core computing. As a case study we consider
the problem of sorting. The goal of this paper is to study the properties of
this model. CP 2004 Elsevier Inc. All rights reserved. (27 Refs.)
Keywords: out-of-core, sorting, parallel disk
model, performance analysis, pario-bib
Abstract: With the widening gap between processor
speeds and disk access speeds, the I/O bottleneck has become critical.
Parallel disk systems have been introduced to alleviate this bottleneck. In
this paper we present deterministic and randomized selection algorithms for
parallel disk systems. The algorithms to be presented, in addition to being
asymptotically optimal, have small underlying constants in their time bounds
and hence have the potential of being practical.
Keywords: I/O algorithms, parallel I/O, pario-bib
An important component of parallel file systems is the
file system interface which has different requirements compared to the normal
UNIX interface particularly the I/O interface. A parallel I/O interface is
required to provide support for non-contiguous access patterns, collective
I/O, large file sizes in order to achieve good performance with parallel
applications. As it supports significantly different functionality, the
interface exposed by a parallel file system assumes importance. So, the file
system needs to either directly provide a parallel I/O interface or at the
least support for such an interface to be implemented on top. The PVFS2
System Interface is the native file system interface for PVFS2 - the next
generation of PVFS. The System Interface provides support for multiple
interfaces such as a POSIX interface or a parallel I/O interface like MPI-IO
to access PVFS2 while also allowing the benefits of abstraction by decoupling
the System Interface from the actual file system implementation. This
document discusses the design and implementation of the System Interface for
PVFS2. Abstract: As Linux clusters emerged as an
alternative to traditional supercomputers one of the problems faced was the
absence of a high-performance parallel file system comparable to the file
systems on the commercial machines. The Parallel Virtual FileSystem(PVFS)
developed at Clemson University has attempted to address this issue. PVFS is
a parallel file system currently used in Parallel I/O research and as a
parallel file system on Linux clusters running high-performance parallel
applications.
Keywords: pvfs, parallel file system, system
interface, pario-bib
Abstract: Multicasting large amounts of data
efficiently to all nodes of a PC cluster is an important operation. In the
form of a partition cast it can be used to replicate entire software
installations by cloning. Optimizing a partition cast for a given cluster of
PCs reveals some interesting architectural tradeoffs, since the fastest
solution does not only depend on the network speed and topology, but remains
highly sensitive to other resources like the disk speed, the memory system
performance and the processing power in the participating nodes. We present
an analytical model that guides an implementation towards an optimal
configuration for any given PC cluster. The model is validated by
measurements on our cluster using Gigabit- and Fast Ethernet links. The
resulting simple software tool, Dolly, can replicate an entire 2 GByte
Windows NT image onto 24 machines in less than 5 minutes.
Keywords: multicast, network, cluster, parallel
I/O, pario-bib
Multi-use clusters of commodity PCs have by far enough storage
on their hard-disk drives for the required local operating-system (OS)
installation and therfore there is a lot of excess storage in a multi-use
cluster. This additional disk space on the nodes should be put to a better
use for a variety of interesting applications e.g.\ for on-line analytic data
processing (OLAP). The specific contributions of the thesis include solutions
to four important problems of optimized resource usage in multi-use-cluster
environments. Analytic models of computer systems are important to
understand the performance of current systems and to predict the performance
of future systems early in the design stage. The thesis instroduces a simple
analytic model of data streams in clusters. The model considers the
topology of data streams as well as the limitations of the edges and nodes.
It also takes into account the limitations of the resources within the nodes,
which are passed through by the data streams. Using the model, the
thesis evaluates different data-casting techniques that can be used to
replicate OS installations to many nodes in clusters. The different
implementations based on IP multicast, star-, tree- and multi-drop-chain
topologies are evaluated with the analytic model as well as with experimental
measurements. As a result of the evaluation, the multi-drop chain is
proposed as most suitable replication technique. When working with
multi-use clusters, we noticed that maintenance of the highly replicated
system software is difficult, because there are many OS installations in
different versions and customisations. Since it is desirable to backup all
older versions and customisations of all OS installations, I implemented
several techniques to archive the large amounts of highly redundant data
contained in the nodes' OS partitions. The techniques take different
approaches of comparing the data, but are all OS independent and work with
whole partition images. The block repositories that store only
unique data blocks prove to be an efficient data storage for OS
installations in multi-use clusters. Finally we look at the
possibilities to take advantage of the excess storage on the many nodes'
hard-disk drives. The thesis investigates several ways to gather data from
multiple server nodes to a client node running the applications. The combined
storage can be used for data-warehousing applications. While powerful
multi-CPU ``killer workstations'' with redundant arrays of inexpensive disks
(RAIDs) are the current workhorses for data warehousing because of their
compatibility with standard databases, they are still expensive compared to
multi-use clusters of commodity PCs. On the other end several researchers in
databases have tried to find domain specific solutions using middleware. My
thesis looks at the question whether, and to what extent, the cost-efficient
multi-use clusters of commodity PCs can provide an alternative
data-warehousing platform with an OS solution that is transparent enough to
run a commodity database system. To answer the question about the most
suitable software layer for a possible implementation, the thesis
compares different distributed file systems and distributed-device systems
against the middleware solution that uses database-internal communication for
distributing partial queries. The different approaches are modelled with the
analytic model and evaluated with a microbenchmark as well as the TPC-D
decision-support benchmark. Given the existing systems and software
packages it looks like the domain specific mid\-dle\-ware-approach delivers
best performance, and in the area of the transparent OS-only solution,
distributed devices are faster than the more complex distributed file systems
and achieve similar performance to a system with local disks only.
Abstract: Over the last few decades, the power of
personal computers (PCs) has grown steadily, following the exponential growth
rate predicted by Moore's law. The trend towards the commoditization of PC
components (such as CPUs, memories, high-speed interconnects and disks)
results in a highly attractive price/performance ratio of the systems built
from those components. Following these trends, I propose to integrate the
commodity IT resources of an entire company or organziation into
multi-use clusters of commodity PCs. These include compute farms,
experimental clusters as well as desktop PCs in offices and labs. This thesis
follows a bottom-up architectural approach and deals with hardware and
system-software architecture with a tight focus on performance and
efficiency. In contrast, the Grid view of providing services instead of
hardware for storage and computation deals mostly with problems of
capability, service and security rather than performance and modelling
thereof.
Keywords: Cluster of PCs, commodity computing,
data streams, multicast, cloning, data storage, distributed file systems,
distributed devices, network-attached disks, OS image distribution, pario-bib
Comment: See also rauch:partitioncast
Keywords: parallel I/O, pario-bib, compilers
Comment: This version is only 2 pages.
reddy:compiler-tr provides the full text. They discuss three primary issues.
1) Overlapping I/O with computation: the compiler's dependency analysis is
used to decide when some I/O may be moved up and performed asynchronously
with other computation. 2) Parallel execution of I/O statements: if all
sizes are known at compile time, the compiler can insert seeks so that
processes can access the file independently. When writing in the presence of
conditionals they even propose skipping by the maximum and leaving holes in
the file, and they claim that this doesn't hurt (!). 3) Parallel format
conversion: again, if there are fixed-width fields the compiler can have
processors seek to different locations, read data independently, and do
format conversion in parallel. Really all this is saying is that fixed-width
fields are good for parallelism, and that compilers could take advantage of
them.
Keywords: parallel I/O, pario-bib, compilers
Keywords: parallel I/O, hypercube, pario-bib
Comment: Emphasis is on adjacency. It also implies
(and they assume) that data is distributed well across the disks so no data
needs to move beyond the neighbors of an I/O node. Still, the idea of
adjacency is good since it allows for good data distribution while not
requiring it, and for balancing I/O procs among procs in a good way. Also
avoids messing up the hypercube regularity with (embedded) dedicated I/O
nodes.
Keywords: parallel I/O, hypercube, pario-bib
Comment: See reddy:hyperio3 for extended version.
Keywords: parallel I/O, hypercube, pario-bib
Comment: An overall paper restating their
embedding technique from reddy:hyperio1, plus a little bit of evaluation
along the lines of reddy:pario2, plus some ideas about matrix layout on the
disks. They claim that declustering is important, since synchronized disks do
not provide enough parallelism, especially in the communication across the
hypercube (since the synchronized disks must hang off one node).
Keywords: parallel I/O, disk array, disk striping,
pario-bib
Comment: see also expanded version reddy:pario2
Keywords: parallel I/O, disk array, disk striping,
pario-bib
Comment: Compares declustered disks (sortof
MIMD-like) to synchronized-interleaved (SIMD-like). Declustering needed for
scalability, and is better for scientific workloads. Handles large
parallelism needed for scientific workloads and for RAID-like architectures.
Synchronized interleaving is better for general file system workloads due to
better utilization and reduction of seek overhead.
Keywords: parallel I/O, disk array, disk striping,
pario-bib
Comment: nothing new over expanded version
reddy:pario2, little different from reddy:pario
Keywords: parallel I/O, file access pattern,
workload, multiprocessor file system, benchmark, pario-bib
Comment: Using five applications from the Perfect
benchmark suite, they studied both implicit (paging) and explicit (file) I/O
activity. They found that the paging activity was relatively small and that
sequential access to VM was common. All access to files was sequential,
though this may be due to the programmer's belief that the file system is
sequential. Buffered I/O would help to make transfers bigger and more
efficient, but there wasn't enough rereferencing to make caching useful.
Keywords: parallel I/O, multiprocessor
architecture, pario-bib
Comment: Much of the material in this thesis has
been published in other papers, i.e., reddy:io, reddy:notsame,
reddy:hyperio1, reddy:hyperio2, reddy:hyperio3, reddy:pario, reddy:pario2,
reddy:pario3, reddy:perfectio, reddy:mmio. He traces some ``Perfect''
benchmarks to determine paging and file access patterns. He simulates a
variety of declustered, synchronized, and synchronized-declustered striping
configurations under both ``file'' and ``scientific'' workloads to determine
which is best. He proposes embeddings for I/O nodes in hypercubes, where the
I/O nodes are just like regular nodes but with an additional I/O processor
and disk(s). He studies the disk configurations again, when embedded in
hypercubes. He proposes ways to lay out matrices (in blocked form) across
disks in a hypercube. He proposes a new parity-based fault-tolerance scheme
that prevents overloading during failure-mode access. And he considers
compiler issues: overlapping I/O with computation, parallelizing I/O
statements, and parallel format conversion.
Keywords: parallel I/O, pario-bib, dfk
Comment: This paper summarizes the presentations
made by panel members at the ICPP panel discussion on parallel I/O, and the
ensuing discussion.
Keywords: I/O characterization, checkpointing,
collective I/O, parallel database, I/O optimization, pario-bib
Keywords: MIMD, parallel architecture, shared
memory, parallel I/O, pario-bib
Comment: This describes the Monarch computer from
BBN. It was never built. 65K processors and memory modules. 65GB RAM.
Bfly-style switch in dance-hall layout. Switch is synchronous; one switch
time is a frame (one microsecond, equal to 3 processor cycles) and all
processors may reference memory in one frame time. Local I-cache only.
Contention reduces full bandwidth by 16 percent. Full 64-bit machine. Custom
VLSI. Each memory location has 8 tag bits. One allows for a location to be
locked by a processor. Thus, any FetchAndOp or full/empty model can be
supported. I/O is done by adding I/O processors (up to 2K in a 65K-proc
machine) in the switch. They plan 200 disks, each with an I/O processor, for
65K nodes. They would spread each block over 9 disks, including one for
parity (essentially RAID).
Abstract: The increasing performance and
decreasing cost of processors and memory are causing system intelligence to
move into peripherals from the CPU. Storage system designers are using this
trend toward "excess" compute power to perform more complex processing and
optimizations inside storage devices. To date, such optimizations have been
at relatively low levels of the storage protocol. At the same time, trends in
storage density, mechanics, and electronics are eliminating the bottleneck in
moving data off the media and putting pressure on interconnects and host
processors to move data more efficiently. We propose a system called Active
Disks that takes advantage of processing power on individual disk drives to
run application-level code. Moving portions of an application's processing to
execute directly at disk drives can dramatically reduce data traffic and take
advantage of the storage parallelism already present in large systems today.
We discuss several types of applications that would benefit from this
capability with a focus on the areas of database, data mining, and
multimedia. We develop an analytical model of the speedups possible for
scan-intensive applications in an Active Disk system. We also experiment with
a prototype Active Disk system using relatively low-powered processors in
comparison to a database server system with a single, fast processor. Our
experiments validate the intuition in our model and demonstrate speedups of
2x on 10 disks across four scan-based applications. The model promises linear
speedups in disk arrays of hundreds of disks, provided the application data
is large enough. (57 refs.)
Keywords: active disks, active storage,
application level code, database server, data mining, pario-bib
This
dissertation presents the factors that will make Active Disks a reality in
the not-so-distant future, the characteristics of applications that will
benefit from this technology, an analysis of the improved performance and
efficiency of systems built around Active Disks, and a discussion of some of
the optimizations that are possible with more knowledge available directly at
the devices. It also compares this work with previous work on database
machines and examines the opportunities that allow us to take advantage of
these promises today where previous approaches have not succeeded. The
analysis is motivated by a set of applications from data mining, multimedia,
and databases and is performed in the context of a prototype Active Disk
system that shows dramatic speedups over a system with traditional, "dumb"
disks. Abstract: Today's commodity disk drives, the basic
unit of storage for computer systems large and small, are actually small
computers, with a processor, memory, and 'network' connection, along with the
spinning magnetic material that permanently stores the data. As more and more
of the information in the world becomes digitally available, and more and
more of our daily activities are recorded and stored, people are increasingly
finding value in analyzing, rather than simply storing and forgetting, these
large masses of data. Sadly, advances in I/O performance have lagged the
development of commodity processor and memory technology, putting pressure on
systems to deliver data fast enough for these types of data-intensive
analysis. This dissertation proposes a system called Active Disks that takes
advantage of the processing power on individual disk drives to run
application-level code. Moving portions of an application's processing
directly to the disk drives can dramatically reduce data traffic and take
advantage of the parallelism already present in large storage systems. It
provides a new point of leverage to overcome the I/O bottleneck.
Keywords: storage, active disks, embedded systems,
architecture, databases, data mining, disk scheduling, pario-bib
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: They describe their experience building a
file system for SUNMOS. Paper describes tuning the SCSI device, their
striping strategy, their message-passing tricks, and some performance
results.
Keywords: file prefetching, distributed file
system, parallel I/O, pario-bib
Comment: Part of a special issue on parallel and
distributed I/O.
Keywords: weather simulation, scientific
application, parallel I/O, pario-bib
Comment: Related to hart:grid.
Abstract: This document describes the detailed
design of the CDS, one of the components of the Cache Coherent File System
(CCFS). CCFS has three main components: Client File Server (CLFS), Local File
Server (LFS), Concurrent Disk System (CDS). A CDSs is located on each disk
node, to develop input/output functions in a per node basis. The CDS will
interact with the microkernel drivers to execute real input/output and to
manage the disk system. The CDS includes general services to distribute
accesses to disks, controlling partition information, etc.
Keywords: multiprocessor file system, parallel
I/O, pario-bib
Comment: See carretero:*, rosales:cds, perez:clfs.
Keywords: CPU scheduling, disk scheduling, I/O
model, parallel I/O, pario-bib
Keywords: parallel architecture, shared memory,
MIMD, interconnection network, parallel I/O, memory-mapped files, pario-bib
Comment: Overview of the KSR1.
Keywords: multiprocessor file system, Unix, Mach,
memory mapped file, pario-bib
Comment: Describes the modifications to the OSF/1
AD file system for a multicomputer environment. Goal is for normal Unix
files, not supercomputer access. The big thing was separation of the caching
from backing store management, by pulling out the cache management into the
Extended Memory Management (XMM) subsystem. Normally OSF/1 maps files to Mach
memory objects, which are then accessed (through read() and write()) using
bcopy(). XMM makes it possible to access these memory objects from any node
in the system, providing coherent compute-node caching of pages from the
memory object. It uses tokens controlled by the XMM server at the file's
server node to support a single-reader, single-writer policy on the whole
file, but migrating page by page. They plan to extend to multiple writers,
but atomicity constraints on the file pointer and metadata make it difficult.
Files are NOT striped across file servers or I/O nodes. Several hacks were
necessary to work around Mach interface problems. Unix buffer caching is
abandoned. Future includes supercomputer support in the form of turning off
all caching. No performance evaluation included. See zajcew:osf1.
Keywords: parallel I/O, multiprocessor file system
interface, pario-bib
Comment: They believe that the API should be
Unix-compatible, systems must support scalable performance on large transfers
of data, and that systems must support very large files. Most of the paper is
specifics about the Paragon PFS interface, which has many features not
mentioned in earlier PFS papers. Contact brad@ssd.intel.com or
payne@ssd.intel.com.
Keywords: parallel I/O workload, file access
pattern, Intel, pario-bib
Comment: A sample code that tries to behave like a
parallel ARC3D in terms of its output. It writes two files, one containing
three three-dimensional matrices X, Y, and Z, and the other containing the
four-dimensional matrix Q. The matrices are spread over all the nodes, and
each file is written in parallel by the processors. See also ryan:navier.
Keywords: parallel application, CFD, parallel I/O,
pario-bib
Comment: This paper goes with the ryan:cfs code
example. Describes their parallel implementation of the ARC3D code on the
iPSC/860. A section of the paper considers I/O, which is to write out a large
multidimensional matrix at each timestep. They found that it was actually
faster to write to separate files because of congestion in the I/O nodes was
hurting performance. They never got more than 2 MB/s, even so, on a system
that should obtain 7-10 MB/s peak.
Keywords: parallel I/O, disk striping, disk array,
pario-bib
Comment: See the techreport salem:striping for a
nearly identical but more detailed version.
Keywords: parallel I/O, disk striping, disk array,
pario-bib
Comment: Cite salem:diskstripe instead. Basic
paper on striping. For uniprocessor, single-user machine. Interleaving
asynchronous, even without matching disk locations though this is discussed.
All done with models.
Keywords: hypercube, multiprocessor file system
interface, pario-bib
Comment: Previously, hypercubes were programmed as
a combination of host and node programs. Salmon proposes to use a universal
host program that acts essentially as a file server, responding to requests
from the node programs. Two modes: crystalline, where node programs run in
loose synchrony, and amorphous, where node programs are asynchronous. In the
crystalline case, files have a single file pointer and are either single- or
multiple- access; single access means all nodes must simultaneously issue the
same request; multiple access means they all simultaneously issue the same
request with different parameters, giving an interleaved pattern. Amorphous
allows asynchronous activity, with separate file pointers per node.
Abstract: Hierarchical treecodes have, to a large
extent, converted the compute-bound N-body problem into a memory-bound
problem. The large ratio of DRAM to disk pricing suggests use of out-of-core
techniques to overcome memory capacity limitations. We will describe a
parallel, out-of-core treecode library, targeted at machines with independent
secondary storage associated with each processor. Borrowing the space-filling
curve techniques from our in-core library, and ``manually'' paging, results
in excellent spatial and temporal locality and very good performance.
Keywords: parallel I/O, out of core applications,
scientific computing, pario-bib
Comment: Only published on CD-ROM
Abstract: Random redundant allocation of data to
parallel disk arrays can be exploited to achieve low access delays. New
algorithms are proposed which improve the previously known shortest queue
algorithm by systematically exploiting that scheduling decisions can be
deferred until a block access is actually started on a disk. These algorithms
are also generalized for coding schemes with low redundancy. Using extensive
experiments, practically important quantities are measured which have so far
eluded an analytical treatment: the delay distribution when a stream of
requests approaches the limit of the system capacity, the system efficiency
for parallel disk applications with bounded prefetching buffers, and the
combination of both for mixed traffic. A further step towards practice is
taken by outlining the system design for alpha : automatically load-balanced
parallel hard-disk array. (31 refs.)
Keywords: parallel disks, lazy scheduling, random
redundant storage, I/O algorithm, random block placement, bipartite matching,
pario-bib
Comment: Also see later version sanders:jasync.
Abstract: In this paper we document our experience
implementing MPI-IO file access using MPI datatypes. We present performance
results and discuss two significant problems that stem from the flexibility
of MPI datatypes. First, MPI datatypes can be used to specify non-contiguous
access patterns. Optimizing data transfers for such patterns is difficult.
Second, the behavior of MPI datatypes in a heterogenous environment is not
well-defined.
Keywords: parallel I/O, multiprocessor file system
interface, pario-bib
Comment: They devise several file-access
strategies for different situations, depending on the particulars of the
etypes and filetypes in use: sequential, two-phase I/O, one file access per
etype (random access), and one file access per etype element (random access
with smaller pieces). They measure the performance of their system with
example patterns that trigger each strategy. It would be nice to see a more
extensive performance analysis of their implementation, and of their
strategies.
Abstract: Allocation of data to a parallel disk
using redundant storage and random placement of blocks can be exploited to
achieve low access delays. New algorithms are proposed which improve the
previously known shortest queue algorithm by systematically exploiting the
fact that scheduling decisions can be deferred until a block access is
actually started on a disk. These algorithms are also generalized for coding
schemes with low redundancy. Using extensive simulations, practically
important quantities are measured which have so far eluded an analytical
treatment: The delay distribution when a stream of requests approaches the
limit of the system capacity, the system efficiency for parallel disk
applications with bounded prefetching buffers, and the combination of both
for mixed traffic. A further step toward practice is taken by outlining the
system design for a.: automatically load-balanced parallel hard-disk array.
Additional algorithmic measures are proposed for a that allow variable sized
blocks, seek time reduction, fault tolerance, inhomogeneous systems, and
flexible priorization schemes. (41 refs.)
Keywords: parallel disks, lazy scheduling, random
redundant storage, I/O algorithm, random block placement, bipartite matching,
pario-bib
Abstract: For the design and analysis of
algorithms that process huge data sets, a machine model is needed that
handles parallel disks. There seems to be a dilemma between simple and
flexible use of such a model and accurate modeling of details of the
hardware. This paper explains how many aspects of this problem can be
resolved. The programming model implements one large logical disk allowing
concurrent access to arbitrary sets of variable size blocks. This model can
be implemented efficiently on multiple independent disks even if zones with
different speed, communication bottlenecks and failed disks are allowed.
These results not only provide useful algorithmic tools but also imply a
theoretical justification for studying external memory algorithms using
simple abstract models. The algorithmic approach is random redundant
placement of data and optimal scheduling of accesses. The analysis
generalizes a previous analysis for simple abstract external memory models in
several ways (higher efficiency, variable block sizes, more detailed disk
model).
Keywords: parallel I/O, pario-bib
Keywords: RAID, disk array, parallel I/O,
pario-bib
Comment: RAID array that relaxes the consistency
requirements, to not write parity during busy periods, then to go back and
update parity during idle periods. Thus you sacrifice a little reliability
for performance; you can select how much.
Keywords: parallel I/O, disk array, disk striping,
load balance, pario-bib
Comment: Updated as scheuermann:partition2. They
describe a file system that attempts to choose both the degree of
declustering and the striping unit size to accomodate the needs of different
files. They also decsribe static and dynamic placement and migration policies
to readjust the load across disks. Note that there are several references in
the bib that are about their file system, called FIVE. Seems to be the same
as scheuermann:tunable.
Abstract: Parallel disk systems provide
opportunities for exploiting I/O parallelism in two possible ways, namely via
inter-request and intra-request parallelism. In this paper, we discuss the
main issues in performance tuning of such systems, namely striping and load
balancing, and show their relationship to response time and throughput. We
outline the main components of an intelligent, self-reliant file system that
aims to optimize striping by taking into account the requirements of the
applications, and performs load balancing by judicious file allocation and
dynamic redistributions of the data when access patterns change. Our system
uses simple but effective heuristics that incur only little overhead. We
present performance experiments based on synthetic workloads and real-life
traces.
Keywords: parallel I/O, disk array, disk striping,
load balance, pario-bib
Comment: Updated version of scheuermann:partition.
Keywords: parallel I/O, disk array, disk striping,
pario-bib
Comment: Seems to be the same as
scheuermann:partition.
Keywords: parallel file system, cluster computing,
parallel I/O, pario-bib
Keywords: parallel I/O, pario-bib
Comment: A brief overview of issues in parallel
I/O, and a short case study of the data-intensive computational grid at CERN.
Keywords: parallel I/O, pario-bib
Comment: In the context of client-server database
systems, they propose to make a compromise between shared-disk architectures,
where the disks are all attached to the network and all machines are both
clients and servers, and a system where the disks are attached to a single
server. Their compromise attaches the disks to both the network and the
server.
Abstract: The HCSA (Hybrid Client-Server
Architecture), a flexible system layout that combines the advantages of the
traditional Client-Server Architecture (CSA) with those of the Shared Disk
Architecture (SDA), is introduced. In HCSA, the traditional CSA-style
I/O subsystem is modified to give the clients network access to both the
server and the server's set of disks. Hence, the HCSA is more
fault-tolerant than the CSA since there are two paths between any client and
the shared data. Moreover, a simulation study demonstrates that the
HCSA is able to support a larger number of clients than the CSA or SDA under
similar system workloads. Finally, the HCSA can run applications in
either a CSA mode, an SDA mode, or a combination of the two, thus offering
backward compatibility with a large number of existing applications.
Keywords: parallel I/O architecture, pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Keywords: parallel I/O, IBM SP-2, pario-bib
Keywords: I/O, data distribution, medical imaging
application, parallel I/O, pario-bib
Comment: The paper describes DIOM (Distributed I/O
management), a system to manage data distributed to local disks of a cluster
of workstations. The distribution process uses semantic information from both
the data set and the application to decide how to distribute the data. The
data is stored using a self-describing format (similar to HDF). The
description of the data is either stored in a file header, or it is part of a
central repository (format identified by file suffix). DIOM decides how to
distribute the data based on the application-supplied splitting-pattern , of
which there are three types: single (copy all data to a single node), block
(divide data evenly between the nodes), round (stripe blocks in a round-robin
fashion). Parameters such as stripe size, initial node, etc, are defined by
the app.
Keywords: parallel I/O, RAID, disk array, disk
architecture, pario-bib
Comment: Very practical description of the RAID I
prototype.
Keywords: parallel I/O, reliability, RAID, disk
array, disk architecture, pario-bib
Comment: Published version of second paper in
chen:raid. Some overlap with schulze:raid, though that paper has more detail.
Keywords: parallel disks, disk array, parity,
RAID, pario-bib
Abstract: Parity-declustered data layouts were
developed to reduce the time for on-line failure recovery in disk arrays.
They generally require perfect balancing of reconstruction workload among the
disks; this restrictive balance condition makes such data layouts difficult
to construct. In this paper, we consider approximately balanced data layouts,
where some variation in the reconstruction workload over the disks is
permitted. Such layouts are considerably easier to construct than perfectly
balanced layouts. We consider three methods for constructing approximately
balanced data layouts, and analyze their performance both theoretically and
experimentally. We conclude that on uniform workloads, approximately balanced
layouts have performance nearly identical to that of perfectly balanced
layouts.
Keywords: disk array, parity, RAID, parallel I/O,
pario-bib
Abstract: Parity declustering has been used to
reduce the time required to reconstruct a failed disk in a disk array. Most
existing work on parity declustering uses BIBD-based data layouts, which
distribute the workload of reconstructing a failed disk over the remaining
disks of the array with perfect balance. For certain array sizes, however,
there is no known BIBD-based layout. In this paper, we evaluate data layouts
that are approximately balanced - that is, that distribute the
reconstruction workload over the disks of the array with only approximate
balance. Approximately balanced layouts are considerably easier to construct
than perfectly balanced layouts. We consider three methods for generating
approximately balanced layouts: randomization, simulated annealing, and
perturbing a BIBD-based layout whose size is near the desired size. We
compare the performance of these approximately balanced layouts with that of
perfectly balanced layouts using a disk array simulator. We conclude that, on
uniform workloads, approximately balanced data layouts have performance
nearly identical to that of perfectly balanced layouts. Approximately
balanced layouts therefore provide the reconstruction performance benefits of
perfectly balanced layouts for arrays where perfectly balanced layouts are
either not known, or do not exist.
Keywords: parallel I/O, disk array, parity, RAID,
pario-bib
Abstract: Large systems of linear equations arise
in a number of scientific and engineering applications. In this paper we
describe the implementation of a family of disk based linear equation solvers
and the required characteristics of the I/O system needed to support them.
Keywords: parallel I/O, scientific computing,
matrix factorization, Intel, pario-bib
Comment: Invited speaker. See also scott:solvers.
This gives a very brief overview of Intel's block solver and slab solver,
both out-of-core linear-systems solvers. He notes a few optimizations that
had to be made to CFS to make it work: data and metadata needed to have equal
priority in the cache, because often the (higher-priority) metadata was
crowding out the data; and they had to restrict some files to small subsets
of disks to reduce the contention for the cache at each I/O node caused by
large groups of processors all requesting at the same time (see nitzberg:cfs
for the same problem).
Keywords: parallel I/O, scientific computing,
Intel, pario-bib
Comment: He discusses ProSolver-DES, which factors
large matrices by swapping square submatrices in and out of memory, and
Intel's new solver, which swaps column blocks in and out. The new solver is a
little slower, but allows full pivoting, which is needed for stability in
some matrices. A short paper with little detail. Some performance numbers.
See scott:matrix.
Keywords: parallel I/O, pario-bib
Comment: ``This paper shows how compression can be
used to speed up parallel i/o of large arrays. The current version of the
paper focuses on improving write performance.'' They use chunked files like
in seamons:interface but before writing they compress each chunk on its
compute node, and after reading they decompress each chunk on its compute
node. Presumably this is only useful when you plan to read back whole chunks.
They find better performance for compressing in many cases, even when the
compression time dominates the I/O time, because it reduces the I/O time so
much. They found that the compression time and compression ratio can vary
widely from chunk to chunk, leading to a tremendous load imbalance that
unfortunately spoils some of the advanatages if all compute nodes must wait
for the slowest to finish.
Keywords: parallel I/O, multiprocessor file system
interface, pario-bib
Comment: ``This paper shows what large performance
gains can be made for parallel i/o of large arrays by using a carefully
implemented library interface for i/o that makes use of array chunking. For
example, the authors obtained a factor of 10 speedup in output of time step
data by using the natural array chunks of the problem decomposition as the
units of i/o on an Intel iPSC/860. The paper also presents results from
experiments with the use of chunking in checkpointing and restarts on
parallel architectures, and the use of chunking with memory-mapped data files
in visualization on sequential architectures.'' They describe a library that
supports chunked representations of matrices. That is, ways to checkpoint,
output, or input multidimensional matrices to files in a blocked rather than
row-major or column-major layout. This helps the file be more versatile for
reading in a variety of dimensions. Their experiments show good performance
improvements, although they only tried it for an application whose data set
in memory was already in a blocked distribution - I would guess that smaller
improvements might come from column- or row-oriented memory distributions.
Also, some of their performance improvement came from characteristics
specific to the Intel CFS file system, having to do with its IOP-cache
management policies. See also seamons:schemas and seamons:compressed.
Keywords: parallel I/O, collective I/O, pario-bib
Scientists with
high-performance computing needs are plagued by applications suffering poor
i/o performance and are burdened with the need to consider low-level physical
storage details of persistent arrays in order to reach acceptable i/o
performance levels, especially with existing parallel i/o facilities. The
Panda i/o library (URL http://bunny.cs.uiuc.edu/CADR/panda.html) serves as a
concrete example of a methodology for freeing application developers from
unnecessary storage details through high-level abstract interfaces and
providing them with increased performance and greater portability. Panda
addresses these problems by introducing high-level application program
interfaces for array i/o on both parallel and sequential machines, and by
developing an efficient commodity-parts-based implementation of those
interfaces across a variety of computer architectures. It is costly to build
a file system from scratch and we designed Panda to run on top of existing
commodity file systems such as AIX; excellent performance using this approach
implies immediate and broad applicability. High-level interfaces provide ease
of use, application portability, and, most importantly, allow plenty of
flexibility for an efficient underlying implementation. A high-level view of
an entire i/o operation, made possible with Panda's high level interfaces,
allows Panda to optimize reading and writing arrays to the host file system
on the i/o nodes using Panda's server-directed i/o architecture. Panda
focuses specifically on multidimensional arrays, the data type at the root of
i/o performance problems in scientific computing. The Panda i/o library
exhibits excellent performance on the NASA Ames NAS IBM SP2, attaining
83-98% of peak AIX performance on each i/o node in the experiments
described in this paper. We expect high-level interfaces such as Panda's to
become the interfaces of choice for scientific applications in the future. As
Panda can be easily added on top of existing parallel file systems and
ordinary file systems without changing them, Panda illustrates a way to
obtain cheap, fast, and easy-to-use i/o for high-performance scientific
applications. Abstract: This four-page paper, written for an
audience from the supercomputing/parallel i/o community, is a nice succinct
introduction to Panda. Abstract and summary:
Keywords: parallel I/O, scientific computing,
pario-bib
Comment: Just a short 4-page summary of the Panda
I/O library, including some brief performance results.
Abstract: We present the architecture and
implementation results for Panda 2.0, a library for input and output of
multidimensional arrays on parallel and sequential platforms. Panda achieves
remarkable performance levels on the IBM SP2, showing excellent scalability
as data size increases and as the number of nodes increases, and provides
throughputs close to the full capacity of the AIX file system on the SP2 we
used. We argue that this good performance can be traced to Panda's use of
server-directed i/o (a logical-level version of disk-directed i/o [Kotz94b])
to perform array i/o using sequential disk reads and writes, a very high
level interface for collective i/o requests, and built-in facilities for
arbitrary rearrangements of arrays during i/o. Other advantages of Panda's
approach are ease of use, easy application portability, and a reliance on
commodity system software.
Keywords: collective I/O, parallel I/O, pario-bib
Comment: This rewrite of Panda (see
seamons:interface) is in C++ and runs on the SP2. They provide simple ways to
declare the distribution of your array in memory and on disk, to form a list
of arrays to be output at each timestep or at each checkpoint, and then to
call for a timestep or checkpoint. Then they use something like disk-directed
I/O (kotz:jdiskdir) internally to accomplish the rearrangement and transfer
of data from compute nodes to I/O nodes. Note proceedings only on CD-ROM and
WWW.
Keywords: parallel I/O, scientific database,
scientific computing, pario-bib
Comment: ``This paper presents PANDA's high-level
interfaces for i/o operations, including checkpoint, restart, and time step
output, and explains the rationale behind them.'' Basically they provide a
bit of detail for the file formats they use in seamons:interface
This thesis presents a high-level interface
for array i/o and three implementation architectures, embodied in the Panda
(Persistence AND Arrays) array i/o library. The high-level interface
contributes to application portability, by encapsulating unnecessary details
and being easy to use. Performance results using Panda demonstrate that an
i/o system can provide application programs with a high-level, portable,
easy-to-use interface for array i/o without sacrificing performance or
requiring custom system software; in fact, combining all these benefits may
only be possible through a high-level interface due to the great freedom and
flexibility a high-level interface provides for the underlying
implementation. The Panda server-directed i/o architecture is a prime
example of an efficient implementation of collective array i/o for closely
synchronized applications in distributed-memory single-program multiple-data
(SPMD) environments. A high-level interface is instrumental to the good
performance of server-directed i/o, since it provides a global view of an
upcoming collective i/o operation that Panda uses to plan sequential reads
and writes. Performance results show that with server-directed i/o, Panda
achieves throughputs close to the maximum AIX file system throughput on the
i/o nodes of the IBM SP2 when reading and writing large multidimensional
arrays. Abstract: Multidimensional arrays are a
fundamental data type in scientific computing and are used extensively across
a broad range of applications. Often these arrays are persistent, i.e., they
outlive the invocation of the program that created them. Portability and
performance with respect to input and output (i/o) pose significant
challenges to applications accessing large persistent arrays, especially in
distributed-memory environments. A significant number of scientific
applications perform conceptually simple array i/o operations, such as
reading or writing a subarray, an entire array, or a list of arrays. However,
the algorithms to perform these operations efficiently on a given platform
may be complex and non-portable, and may require costly customizations to
operating system software.
Keywords: parallel I/O, persistent data, parallel
computing, pario-bib
Comment: see also chen:panda, seamons:panda,
seamons:compressed, seamons:interface, seamons:schemas, seamons:msio,
seamons:jpanda
Abstract: This paper discusses the design and
implementation of a cluster file system, called PVFS-PM, on the SCore cluster
system software. This is the first attempt to implement a cluster file system
on the SCore system. It is based on the PVFS cluster file system but replaces
TCP with the PMv2 communication library supported by SCore to provide a
scalable, high-performance cluster file system. PVFS-PM improves the
performance by factors of 1.07 and 1.93 for writing and reading,respectively,
with 8 I/O nodes, compared with the original PVFS on TCP on a Gigabit
Ethernet-connected SCore cluster.
Keywords: parallel I/O, pario-bib
Abstract: Parallel disks provide a cost effective
way of speeding up I/Os in applications that work with large amounts of data.
The main challenge is to achieve as much parallelism as possible, using
prefetching to avoid bottlenecks in disk access. Efficient algorithms have
been developed for some particular patterns of accessing the disk blocks, In
this paper, we consider general request sequences. When the request sequence
consists of unique block requests, the problem is called prefetching and is a
well-solved problem for arbitrary request sequences. When the reference
sequence can have repeated references to the same block, we need to devise an
effective caching policy as well. While optimum offline algorithms have been
recently designed for the problem, in the online case, no effective algorithm
was previously known. Our main contribution is a deterministic online
algorithm threshold-LRU which achieves O((MD/L) {sup 2/3}) competitive ratio
and a randomized online algorithm threshold-MARK which achieves O({square
root}(MD/L) log(MD/L)) competitive ratio for the caching/prefetching problem
on the parallel disk model (PDM), where D is the number of disks, M is the
size of fast memory buffer, and M + L is the amount of lookahead available in
the request sequence. The best-known lower bound on the competitive ratio is
{Omega}( {square root}MD/L) for lookahead L GRE M in both models. We also
show that if the deterministic online algorithm is allowed to have twice the
memory of the offline then a tight competitive ratio of {Theta}( {square
root}MD/L) can be achieved. This problem generalizes the well-known paging
problem on a single disk to the parallel disk model.
Keywords: online algorithms, prefetching, caching,
parallel disk model, threshold LRU, pario-bib
Abstract: Effective high-level data management is
becoming an important issue with more and more scientific applications
manipulating huge amounts of secondary-storage and tertiary-storage data
using parallel processors. A major problem facing the current solutions to
this data management problem is that these solutions either require a deep
understanding of specific data storage architectures and file layouts to
obtain the best performance (as in high-performance storage management
systems and parallel file systems), or they sacrifice significant performance
in exchange for ease-of-use and portability (as in traditional database
management systems). We discuss the design, implementation, and evaluation of
a novel application development environment for scientific computations. This
environment includes a number of components that make it easy for the
programmers to code and run their applications without much programming
effort and, at the same time, to harness the available computational and
storage power on parallel architectures. (39 refs.)
Keywords: data management, scientific
applications, workflow, parallel file systems, pario-bib
Abstract: One of the challenges brought by
large-scale scientific applications is how to avoid remote storage access by
collectively using sufficient local storage resources to hold huge amounts of
data generated by the simulation while providing high-performance I/O. DPFS,
a distributed parallel file system, is designed and implemented to address
this problem. DPFS collects locally distributed and unused storage resources
as a supplement to the internal storage of parallel computing systems to
satisfy the storage capacity requirement of large-scale applications. In
addition, like parallel file systems, DPFS provides striping mechanisms that
divide a file into small pieces and distributes them across multiple storage
devices for parallel data access. The unique feature of DPFS is that it
provides three file levels with each file level corresponding to a file
striping method. In addition to the traditional linear striping method, DPFS
also provides a novel Multidimensional striping method that can solve
performance problems of linear striping for many popular access patterns.
Other issues such as load-balancing and user interface are also addressed in
DPFS. (C) 2004 Elsevier Inc. All rights reserved.
Keywords: distributed file system, parallel file
system, striping, pario-bib
Abstract: While the storage market grows rapidly,
software RAID, as a low-cost solution, becomes more and more important
nowadays. However the performance of software RAID is greatly constrained by
its implementation. Varies methods have been taken to improve its
performance. By integrating a novel buffer mechanism - DMA aligned buffer
(DAB) into software RAID kernel driver, we achieved a significant performance
improvement, especially on small I/O requests.
Keywords: DMA, software RAID, performance, DMA
aligned buffer, DAB, pario-bib
Keywords: distributed shared memory, parallel I/O,
file I/O, file system, virtual memory, pario-bib
Comment: A parallel-I/O scheme for a system using
DSM, which has one disk per node. The file is initiaally placed on node 0 ,
Application runs once, system then collects informaton about access pattern.
The file is redistributed across all disks. Application must do all file
accesses from node 0, but in subsequent runs this causes the block to be read
from its disk into the local memory of the attached node, and VM-mapped into
the correct place. Later page faults will move the data to the node needing
the data first (if the redistribution is done well, that's the same node, so
no movement is needed). At the end of the program, output data are written to
the output file, on the local disk. Thus: input files go to node 0 on the
first run, then are redistributed before second run, and output files are
created across all nodes but are written only at file close and only to the
closest disk. Limitations: files must be wholly read during application
initialization, from node 0. Files must be wholly written out during the
application completion. Files are immutable. You must have one slow run
initially. Input files must fit on one disk. I read sections 1-2, then
skimmed the rest.
Keywords: parallel I/O, multiprocessor
architecture, MIMD, fault tolerance, pario-bib
Comment: HARTS is a multicomputer connected with a
wrapped hexagonal mesh, with an emphasis on real-time and fault tolerance.
The mesh consists of network routing chips. Hanging off each is a small
bus-based multiprocessor ``node''. They consider how to integrate I/O devices
into this architecture: attach device controllers to processors, to network
routers, to node busses, or via a separate network. They decided to
compromise and hang each I/O controller off three network routers, in the
triangles of the hexagonal mesh. This keeps the traffic off of the node
busses, and allows multiple paths to each controller. They discuss the
reachability and hop count in the presence of failed nodes and links.
Keywords: file system, parallel I/O, pario-bib,
RAID
Comment: This is a file system based on LFS and
run on the RAID-II prototype (see drapeau:raid-ii). It uses the RAID-II
controller's memory (32 MB) to pipeline data transfers from the RAID disks
directly to (from) the network. Thus, data never flows through the server CPU
or memory. The server remains in control, telling the controller where each
block goes, etc. They get very high data rates. And despite being much faster
than the RAID for small writes, they were still CPU-limited, because the CPU
had to handle all the little requests.
Keywords: parallel I/O, database, pario-bib
Comment: Part of a special issue.
Abstract: Current APIs for multiprocessor
multi-disk file systems are not easy to use in developing out-of-core
algorithms that choreograph parallel data accesses. Consequently, the
efficiency of these algorithms is hard to achieve in practice. We address
this deficiency by specifying an API that includes data-access primitives for
data choreography. With our API, the programmer can easily access specific
blocks from each disk in a single operation, thereby fully utilizing the
parallelism of the underlying storage system. Our API supports the
development of libraries of commonly-used higher-level routines such as
matrix-matrix addition, matrix-matrix multiplication, and BMMC
(bit-matrix-multiply/complement) permutations. We illustrate our API in
implementations of these three high-level routines to demonstrate how easy it
is to use.
Keywords: parallel I/O, multiprocessor file system
interface, pario-bib
Comment: Also published as Courant Institute Tech
Report 708.
This paper addresses this lack of
understanding by presenting an introduction to the data-transfer models on
which most of the out-of-core parallel-I/O algorithms are based, with
particular emphasis on the Parallel Disk Model. Sample algorithms are
discussed to demonstrate the paradigms (algorithmic techniques) used with
these models. Our aim is to provide insight into both the paradigms and
the particular algorithms described, thereby also providing a background for
understanding a range of related solutions. It is hoped that this background
would enable the appropriate selection of existing algorithms and the
development of new ones for current and future out-of-core problems.
Abstract: Problems whose data are too large to fit
into main memory are called out-of-core problems. Out-of-core
parallel-I/O algorithms can handle much larger problems than in-memory
variants and have much better performance than single-device variants.
However, they are not commonly used-partly because the understanding of
them is not widespread. Yet such algorithms ought to be growing in importance
because they address the needs of users with ever-growing problem sizes and
ever-increasing performance needs.
Keywords: parallel I/O algorithms, out-of-core,
pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Abstract: We estimate the performance of a
network-wide concurrent file system implemented using conventional disks as
disk arrays. Tests were carried out on both single system and network-wide
environments. On single systems, a file was split across several disks to
test the performance of file I/O operations. We concluded that performance
was proportional to the number of disks, up to four, on a system with high
computing power. Performance of a system with low computing power, however,
did not increase, even with more than two disks. When we split a file across
disks in a network-wide system called the Network-wide Concurrent File System
(N-CFS), we found performance similar to or slightly higher than that of disk
arrays on single systems. Since file access through N-CFS is transparent,
this system enables traditional disks on single and networked systems to be
used as disk arrays for I/O intensive jobs.
Keywords: mass storage, cluster computing,
distributed file system, parallel I/O, pario-bib
Keywords: disk controller, RAID, parallel I/O,
pario-bib
Comment: Describes the RAID controller for the DEC
StorageWorks product.
Abstract: Although there are several extant
studies of parallel scientific application request patterns, there is little
experimental data on the correlation of physical input/output patterns with
application input/output stimuli. To understand these correlations, we have
instrumented the SCSI device drivers of the Intel Paragon OSF/1 operating
system to record key physical input/output activities and have correlated
this data with the input/output patterns of scientific applications captured
via the Pablo analysis toolkit. Our analysis shows that disk hardware
features profoundly affect the distribution of request delays and that
current parallel file systems respond to parallel application input/output
patterns in non-scalable ways.
Keywords: parallel I/O application, pario-bib
Comment: In a Special Issue on I/O in Parallel
Applications, volume 12, numbers 3 and 4.
Keywords: adaptive striping, disk striping,
parallel I/O, pario-bib
Keywords: parallel I/O, pario-bib
Comment: They study the performance of a parallel
I/O system when several concurrent processes are accessing a shared set of
disks, using a common buffer pool. They found that under certain
circumstances the system can become unstable, in that some subset of
processes monopolize all of the resources, bringing the others to a virtual
halt. They use analytical models to show that instability can occur if every
process has distinct input and output disks, reads are faster than writes,
disk scheduling policy of a certain class, and processes don't wait for other
resources.
Abstract: In a shared-disk parallel I/O system,
several processes may be accessing the disks concurrently. An important
example is concurrent external merging arising in database management systems
with multiple independent sort queries. Such a system may exhibit
instability, with one of the processes racing ahead of the others and
monopolizing I/O resources. This race can lead to serialization of the
processes and poor disk utilization, even when the static load on the disks
is balanced. The phenomenon can be avoided by proper layout of data on the
disks, as well as through other I/O management strategies. This has
implications for both data placement in multiple disk systems and task
partitioning for parallel processing.
Keywords: parallel I/O, pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Keywords: parallel I/O, pario-bib
Comment: Several external merges (many sorted runs
into one) are concurrently in action. Where do you put their input and output
runs, that is, on which disks? Only input runs are striped, and usually on a
subset of disks.
Keywords: parallel I/O, pario-bib
Comment: They describe a prefetching scheme where
hints can be provided from the programmer, compiler, or runtime library to
the I/O node. These hints seem to take the form of a sequence (all in order)
or a set (only one of many, from conditional expressions). The hints come
from each process, not collectively. Then, the I/O node keeps these
specifications and uses them to drive prefetching when there is no other work
to do. They rotate among the specifications of many processes. Later they
hope to examine more complex scheduling strategies and buffer-space
allocation strategies.
Abstract: We present the design, implementation,
and evaluation of D-GRAID, a gracefully-degrading and quickly-recovering RAID
storage array. D-GRAID ensures that most files within the file system remain
available even when an unexpectedly high number of faults occur. D-GRAID also
recovers from failures quickly, restoring only live file system data to a hot
spare. Both graceful degradation and live-block recovery are implemented in a
prototype SCSI-based storage system underneath unmodified file systems,
demonstrating that powerful "file-system like" functionality can be
implemented behind a narrow block-based interface.
Keywords: fault tolerance, disk failure, RAID,
D-GRAID, pario-bib
Comment: Awarded best student paper.
Keywords: I/O, workload characterization,
scientific computing, parallel I/O, pario-bib
Comment: Part of jin:io-book, modified from
smirni:evolutionary.
Abstract: The modest I/O configurations and file
system limitations of many current high-performance systems preclude solution
of problems with large I/O needs. I/O hardware and file system parallelism is
the key to achieving high performance. We analyze the I/O behavior of several
versions of two scientific applications on the Intel Paragon XP/S. The
versions involve incremental application code enhancements across multiple
releases of the operating system. Studying the evolution of I/O access
patterns underscores the interplay between application access patterns and
file system features. Our results show that both small and large request
sizes are common, that at present, application developers must manually
aggregate small requests to obtain high disk transfer rates, that concurrent
file accesses are frequent, and that appropriate matching of the application
access pattern and the file system access mode can significantly increase
application I/O performance. Based on these results, we describe a set of
file system design principles.
Keywords: I/O, workload characterization,
scientific computing, parallel I/O, pario-bib
Comment: They study two applications over several
versions, using Pablo to capture the I/O activity. They thus watch as
application developers improve the applications use of I/O modes and request
sizes. Both applications move through three phases: initialization,
computation (with out-of-core I/O or checkpointing I/O), and output. They
found it necessary to tune the I/O request sizes to match the parameters of
the I/O system. In the initial versions, the code used small read and write
requests, which were (according to the developers) the "easiest and most
natural implementation for their I/O." They restructured the I/O to make
bigger requests, which better matched the capabilities of Intel PFS. They
conclude that asynchronous and collective operations are imperative. They
would like to see a file system that can adapt dynamically to adjust its
policies to the apparent access patterns. Automatic request aggregation of
some kind seems like a good idea; of course, that is one feature of a buffer
cache.
Abstract: As both processor and interprocessor
communication hardware is evolving rapidly with only moderate improvements to
file system performance in parallel systems, it is becoming increasingly
difficult to provide sufficient input/output (I/O) performance to parallel
applications. I/O hardware and file system parallelism are the key to
bridging this performance gap. Prerequisite to the development of efficient
parallel file systems is the detailed characterization of the I/O demands of
parallel applications. In the paper, we present a comparative study of
parallel I/O access patterns, commonly found in I/O intensive scientific
applications. The Pablo performance analysis tool and its I/O extensions is a
valuable resource in capturing and analyzing the I/O access attributes and
their interactions with extant parallel I/O systems. This analysis is
instrumental in guiding the development of new application programming
interfaces (APIs) for parallel file systems and effective file system
policies that respond to complex application I/O requirements.
Keywords: workload characterization, parallel I/O,
scientific applications, pario-bib
The results showed that
applications use a combination of both sequential and interleaved access
patterns, which shows that there is a clear need for a more complex API than
what is given by the standard UNIX API. In addition, when applications
required concurrent accesses, they commonly channeled all I/O requests
through a single node. Some form of collective I/O would have helped in these
cases. They also made an observation that despite the existence of several
parallel I/O APIs, programmers of scientific applications preferred to use
standard unix. This is mostly due to the lack of an established portable
standard. Their study was "instrumental in the design and implementation of
MPI-IO". Their section on emerging I/O APIs is particularly interesting.
They comment that "the diversity of I/O request sizes and patterns suggests
that achieving high performance is unlikely with a single file system
policy." Their solution is to have a file system in which the user can give
"hints" to the file system expressing expected access patterns or to have a
file system that automatically classifies access patterns. The file system
can then chose policies to deal with the access patterns. Comment: This paper compares the I/O performance
of five scientific applications from the scalable I/O initiative (SIO) suite
of applications. Their goals are to collect detailed performance data on
applications characteristics and access patterns and to use that information
to design and evaluate parallel file system policies and parallel file system
APIs. The related work section gives a nice overview of recent I/O
characterization studies. They use the Pablo reed:pablo performance
analysis environment to analyze the performance of their five applications.
The applications they chose to evaluate include: MESSKIT and NWChem, two
implementations of the Hartree-Fock method for computational chemistry
applications; QCRD, a quantum chemical reaction dynamics application; PRISM,
a parallel 3D numerical simulation of the Navier-Stokes equations that models
high speed turbulent flow that is periodic in one direction; ECAT, a parallel
implementation of the Schwinger multichannel method used to calculate
low-energy electron molecule collisions.
Abstract: The broadening disparity in the
performance of input/output (I/O) devices and the performance of processors
and communication links on parallel systems is a major obstacle to achieving
high performance for a wide range of parallel applications. I/O hardware and
file system parallelism are the keys to bridging this performance gap. A
prerequisite to the development of efficient parallel file systems is
detailed characterization of the I/O demands of parallel applications. In
this paper, we present a comparative study of the I/O access patterns
commonly found in I/O intensive parallel applications. Using the Pablo
performance analysis environment and its I/O extensions we captured
application I/O access patterns and analyzed their interactions with current
parallel I/O systems. This analysis has proven instrumental in guiding the
development of new application programming interfaces (APIs) for parallel
file systems and in developing effective file system policies that can
adaptively respond to complex application I/O requirements.
Keywords: parallel I/O, pario-bib
Comment: see smirni:lessons
Keywords: I/O architecture, historical summary,
pario-bib
Comment: Classifies I/O systems by how they
initiate and terminate I/O. Uniprocessor and Multiprocessor systems.
Keywords: parallel I/O, multiprocessor file system
interface, pario-bib
Comment: An outline of two possible ways to
specify mappings of arrays to storage nodes in a multiprocessor, and to make
unformatted parallel transfers of multiple records. Seems to apply only to
arrays, and to files that hold only arrays. It keeps the linear structure of
files as sequences of records, but in some cases does not preserve the order
of data items or of fields within subrecords. Tricky to understand unless you
know HPF and Fortran 90.
Abstract: This paper presents the design and
implementation of a mobile storage system called a PersonalRAID. PersonalRAID
manages a number of disconnected storage devices. At the heart of a
PersonalRAID system is a mobile storage device that transparently propagates
data to ensure eventual consistency. Using this mobile device, a PersonalRAID
provides the abstraction of a single coherent storage name space that is
available everywhere, and it ensures reliability by maintaining data
redundancy on a number of storage devices. One central aspect of the
PersonalRAID design is that the entire storage system consists solely of a
collection of storage logs; the log-structured design not only provides an
efficient means for update propagation, but also allows efficient direct I/O
accesses to the logs without incurring unnecessary log replay delays. The
PersonalRAID prototype demonstrates that the system provides the desired
transparency and reliability functionalities without imposing any serious
performance penalty on a mobile storage user.
Keywords: file systems, pario-bib
Abstract: This paper investigates the performance
of a multi-disk storage system equipped with a segmented disk cache
processing a workload of multiple relational scans. Prefetching is a popular
method of improving the performance of scans. Many modern disks have a
multisegment cache which can be used for prefetching. We observe that,
exploiting declustering as a data placement method, prefetching in a
segmented cache causes a load imbalance among several disks. A single disk
becomes a bottleneck, degrading performance of the entire system. A variation
in disk queue length is a primary factor of the imbalance. Using a precise
simulation model, we investigate several approaches to achieving better
balancing. Our metrics are a scan response time for the closed-end system and
an ability to sustain a workload without saturating for the open-end system.
We arrive at two main conclusions: (1) Prefetching in main memory is
inexpensive and effective for balancing and can supplement or substitute
prefetching in disk cache. (2) Disk-level prefetching provides about the same
performance as main memory prefetching if request queues are managed in the
disk controllers rather than in the host. Checking the disk cache before
queuing requests provides not only better request response time but also
drastically improves balancing. A single cache performs better than a
segmented cache for this method.
Keywords: parallel I/O, prefetching, disk cache,
disk array, pario-bib
Comment: An interesting paper about
disk-controller cache management in database workloads. Actually, the
workloads are sequential scans of partitioned files, which could occur in
many kinds of workloads. The declustering pattern (partitioning) is a little
unusual for most scientific parallel I/O veterans, who are used to striping.
And the cache-management algorithms seem a bit strange, particularly the fact
that the cache appears to be used only for explicit prefetch requests. Turns
out that it is best to put the prefetching and disk queueing in the same
place, either on the controller or in main memory, to avoid load imbalance
that arises from randomness in the workload, which is accentuated into a big
bottleneck and a convoy effect.
Keywords: distributed file system, data storage,
mass storage, network-attached disks, disk striping, parallel I/O, pario-bib
Comment: see also preslan:gfs
Keywords: disk mirroring, parallel I/O, pario-bib
Comment: Write one disk (the master) in the usual
way, and write the slave disk at the closest free block. Actually, they
propose to logically partition the two disks so that each disk has a master
partition and a slave partition. Up to 80% improvement in small-write
performance, while retaining good sequential read performance.
Keywords: disk mirroring, parallel I/O, pario-bib
Comment: See solworth:mirror.
Abstract: Processing of data in many data analysis
applications can be represented as an acyclic, coarse grain data flow, from
data sources to the client. This paper is concerned with scheduling of
multiple data analysis operations, each of which is represented as a
pipelined chain of processing on data. We define the scheduling problem for
effectively placing components onto Grid resources, and propose two
scheduling algorithms. Experimental results are presented using a
visualization application.
Keywords: DataCutter, pipeline, dataflow,
pario-bib
Abstract: High performance servers and high-speed
networks will form the backbone of the infra-structure required for
distributed multimedia information systems. Given that the goal of such a
server is to support hundreds of interactive data streams simultaneously,
various tradeoffs are possible with respect to the storage of data on
secondary memory, and its retrieval therefrom. In this paper we identify and
evaluate these tradeoffs. We evaluate the effect of varying the stripe factor
and also the performance of batched retrieval of disk-resident data. We
develop a methodology to predict the stream capacity of such a server. The
evaluation is done for both uniform and skewed access patterns. Experimental
results on the Intel Paragon computer are presented.
Keywords: threads, parallel I/O, pario-bib
Keywords: parallel I/O, parallel file system, disk
mirroring, disk scheduling, pario-bib
Comment: Describes simulation based on model of
disk access pattern. Multiple-disk system, much like in matloff:multidisk.
Files stored in two copies, each on a separate disk, but there are more than
two disks, so this differs from mirroring. He compares several disk
scheduling algorithms. A variant of SCAN seems to be the best.
Keywords: parallel computer architecture,
interconnection network, network interface, distributed memory, systolic
array, input/output, parallel I/O, pario-bib
Comment: See also steenkiste:interface,
kung:network, hemy:gigabit, bornstein:reshuffle, and gross:io.
Keywords: dictionary, survey, parallel I/O,
pario-bib
Comment: A tremendous resource.
Keywords: parallel I/O, RAID, redundancy,
reliability, disk array, pario-bib
Comment: Part of jin:io-book; reformatted version
of stodolsky:logging.
Keywords: parallel I/O, RAID, redundancy,
reliability, pario-bib
Comment: See stodolsky:logging. An in-between
version is CMU-CS-94-170, stodolsky:logging-tr.
Abstract: Parity encoded redundant disk arrays
provide highly reliable, cost effective secondary storage with high
performance for read accesses and large write accesses. Their performance on
small writes, however, is much worse than mirrored disks - the traditional,
highly reliable, but expensive organization for secondary storage.
Unfortunately, small writes are a substantial portion of the I/O workload of
many important, demanding applications such as on-line transaction
processing. This paper presents parity logging, a novel solution to the small
write problem for redundant disk arrays. Parity logging applies journalling
techniques to substantially reduce the cost of small writes. We provide a
detailed analysis of parity logging and competing schemes - mirroring,
floating storage, and RAID level 5 - and verify these models by simulation.
Parity logging provides performance competitive with mirroring, the best of
the alternative single failure tolerating disk array organizations. However,
its overhead cost is close to the minimum offered by RAID level 5. Finally,
parity logging can exploit data caching much more effectively than all three
alternative approaches.
Keywords: parallel I/O, RAID, redundancy,
reliability, disk array, pario-bib
Comment: Cite stodolsky:jlogging. Earlier version
is CMU-CS-93-200. Parity logging to improve small writes. Log all parity
updates; when it fills, go redo parity disk. Actually distribute the parity
and log across all disks. Performance is comparable to, or exceeding,
mirroring. Also handling double failures.
Abstract: Parity encoded redundant disk arrays
provide highly reliable, cost effective secondary storage with high
performance for reads and large writes. Their performance on small writes,
however, is much worse than mirrored disks - the traditional, highly
reliable, but expensive organization for second ary storage. Unfortunately,
small writes are a substantial portion of the I/O workload of many impor
tant, demanding applications such as on-line transaction processing. This
paper presents parity logging, a novel solution to the small write problem
for redundant disk arrays. Parity logging applies journalling techniques to
substantially reduce the cost of small writes. We provide detailed models of
parity logging and competing schemes - mirroring, floating storage, and RAID
level 5 - and verify these models by simulation. Parity logging provides
performance competitive with mirroring, but with capacity overhead close to
the minimum offered by RAID level 5. Finally, parity logging can exploit data
caching more effectively than all three alternative approaches.
Keywords: parallel I/O, disk array, RAID,
redundancy, reliability, pario-bib
Keywords: parallel I/O, database, SIMD, connection
machine, pario-bib
Comment: See also IEEE Computer, Jan 1988, p. 8
and 10. Examines a database query that is parallelized for the Connection
Machine. He shows that in many cases, a smarter serial algorithm that reads
only a portion of the database (through an index) will be faster than 64K
processors reading the whole database. Uses a simple model for the machines
to show this. Reemphasizes the point of Boral and DeWitt that I/O is the
bottleneck of a database machine, and that parallelizing the processing will
not necessarily help a great deal.
Keywords: disk striping, reliability, pario-bib
Comment: Part of jin:io-book; reformatted version
of stonebraker:radd.
Keywords: disk striping, reliability, pario-bib
Comment: This is about ``RADD'', a distributed
form of RAID. Meant for cases where the disks are physically distributed
around several sites, and no one controller controls them all. Much lower
space overhead than any mirroring technique, with comparable normal-mode
performance at the expense of failure-mode performance.
Keywords: parallel I/O, disk array, RAID, Sprite,
disk architecture, database, pario-bib
Comment: Designing a DBMS for Sprite and RAID.
High availability, high performance. Shared memory multiprocessor. Allocates
extents to files that are a interleaved over a variable number of disks, and
over a contiguous set of tracks on those disks.
This
thesis presents an efficient and portable implementation of the Panda array
I/O library. In this implementation, standard software components are used to
build the I/O library to aid its portability. The implementation also
provides a simple, flexible framework for the implementation and integration
of the various collective I/O strategies. The server directed I/O and the
reduced messages server directed I/O algorithms are implemented in the Panda
array I/O library. This implementation supports the sharing of the I/O
servers between multiple applications by extending the collective I/O
strategies. Also, the implementation supports the use of part time I/O nodes
where certain designated compute nodes act as the I/O servers during the I/O
phase of the application. The performance of this implementation of the Panda
array I/O library is measured on the IBM SP2 and the performance results show
that for read and write operations, the collective I/O strategies used by the
Panda array I/O library achieve throughputs close to the maximum throughputs
provided by the underlying file system on each I/O node of the IBM SP2.
Abstract: Parallel computers are a cost effective
approach to providing significant computational resources to a broad range of
scientific and engineering applications. Due to the relatively lower
performance of the I/O subsystems on these machines and due to the
significant I/O requirements of these applications, the I/O performance can
become a major bottleneck. Optimizing the I/O phase of these applications
poses a significant challenge. A large number of these scientific and
engineering applications perform simple operations on multidimensional arrays
and providing an easy and efficient mechanism for implementing these
operations is important. The Panda array I/O library provides simple high
level interfaces to specify collective I/O operations on multidimensional
arrays in a distributed memory single-program multiple-data (SPMD)
environment. The high level information provided by the user through these
interfaces allows the Panda array I/O library to produce an efficient
implementation of the collective I/O request. The use of these high level
interfaces also increases the portability of the application.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Abstract: Disk arrays are widely used in Storage
Area Networks (SANs) to achieve mass storage capacity and high level I/O
parallelism. Data partitioning and distribution among the disks is a
promising approach to minimize the file access time and balance the I/O
workload. But disk I/O parallelism by itself does not guarantee the optimal
performance of an application. The disk access rates fluctuate with time
because of access pattern variations, which leads to a workload imbalance.
The user access pattern prediction is of great importance to dynamic data
reorganization between hot and cool disks. Data migration occurs according to
current and future disk allocation states and access frequencies. The
objective of this paper is to develop a neural network based disk allocation
trend prediction method and optimize the disks' file capacity to their
balanced level. A Levenberg-Marquardt neural network was adopted to predict
the disk access frequencies with the I/O track. History. Data reorganization
on disk arrays was optimized to provide a good workload balance. The
simulation results proved that the proposed method performs well.
Keywords: SAN, dynamic data reorganization, neural
network, access pattern prediction, pario-bib
Keywords: disk mirroring, parallel I/O, pario-bib
Comment: MetaDisk is a addition to the Sun
SPARCstation server kernel. It allows disk mirroring between any two local
disk partitions, or concatenation of several disk partitions into one larger
partition. Can span up to 4 partitions simultaneously. Appears not to be
striped, just allows bigger partitions, and (by chance) some parallel I/O for
large files.
Abstract: Disk array subsystems have serious
demands for higher speed and greater number of channels along with the trends
in improving operational efficiency of information system by integrating its
storage subsystems. Conventional disk array subsystem employs a
bus-structured connection between its microprocessors and shared cache and
control memories. In general, a network-structured connection can be faster
as compared with a bus-structured one although a switch causes higher
latency. In this paper we propose a hybrid star-net connection consisting of
a hierarchically switched star fan-out for cache memory and a direct star
fan-out for control memory, where cache is used as a temporary store of host
data, and control memory stores various control data including cache control
tables. The latter requires more speed than the former. Based on the proposed
connection, we developed a disk array subsystem with host interface having 32
channels, and evaluated its performance. We could attain sequential
performance of 920MB/s and transaction performance of 160KIO/s. In comparison
to the conventional disk array subsystem, the former is 5 times, and the
latter is 2.5 times better. (12 refs.)
Keywords: disk array, star network topology,
shared cache, pario-bib
Keywords: parallel application, parallel I/O,
pario-bib
Comment: guest editorial, no abstract
Abstract: CPU speeds are increasing at a much
faster rate than secondary storage device speeds. Many important applications
face an I/O bottleneck. We demonstrate that this bottleneck can be alleviated
through 1) scalable striping of data and 2) caching/prefetching techniques.
This paper describes the design and performance of the Tower of Pizzas
(TOPs), a portable software system providing parallel I/O and buffering
services.
Keywords: parallel I/O, pario-bib
Comment: Same as CS-TR-3462 from Department of
Computer Science. Basically, a parallel file system for a workstation cluster
using the usual parallel file-system ideas. They do support client-side
caching, using a client-side server process which shares memory with the
client. Otherwise nothing really new.
Keywords: parallel I/O, multimedia, video on
demand, pario-bib
Comment: They describe a video server system being
developed at the Sarnoff Real Time Corporation. This paper describes their
simulated system. It is intended as more than a video-on-demand system, but
also for capture and processing as well as playback. So they have a complex
system of interconnected SIMD boards, each with a high-speed link to various
devices, including a collection of disk drives. Data is striped across disks.
They integrate playback scheduling and the disk striping in an interesting
way.
Keywords: debugging, visualization, parallel file
system, parallel I/O, pario-bib
Keywords: cluster, parallel I/O, pario-bib
Comment: Part of jin:io-book; reformatted version
of tewari:high.
Abstract: Clustered multimedia servers, consisting
of interconnected nodes and disks, have been proposed for large-scale servers
that are capable of supporting multiple concurrent streams which access the
video objects stored in the server. As the number of disks and nodes in the
cluster increases, so does the probability of a failure. With data striped
across all disks in a cluster, the failure of a single disk or node results
in the disruption of many or all streams in the system. Guaranteeing high
availability in such a cluster becomes a primary requirement to ensure
continuous service. In this paper, we study mirroring and software RAID
schemes with different placement strategies that guarantee high availability
in the event of disk and node failures while satisfying the real-time
requirements of the streams. We examine various declustering techniques for
spreading the redundant information across disks and nodes and show that
random declustering has good real-time performance. Finally, we compare the
overall cost per stream for different system configurations. We derive the
parameter space where mirroring and software RAID apply, and determine
optimal parity group sizes
Keywords: cluster, parallel I/O, pario-bib
Abstract: In this paper, we propose a strategy for
implementing parallel-I/O interfaces portably and efficiently. We have
defined an abstract-device interface for parallel I/O, called ADIO. Any
parallel-I/O API can be implemented on multiple file systems by implementing
the API portably on top of ADIO, and implementing only ADIO on different file
systems. This approach simplifies the task of implementing an API and yet
exploits the specific high-performance features of individual file systems.
We have used ADIO to implement the Intel PFS interface and subsets of MPI-IO
and IBM PIOFS interfaces on PFS, PIOFS, Unix, and NFS file systems. Our
performance studies indicate that the overhead of using ADIO as an
implementation strategy is very low.
Keywords: parallel I/O, multiprocessor file system
interface, pario-bib
Keywords: multiprocessor file system interface,
parallel I/O, pario-bib
Comment: They propose an intermediate interface
that can serve as an implementation base for all parallel file-system APIs,
and which can itself be implemented on top of all parallel file systems. This
``universal'' interface allows all apps to run on all file systems with no
porting, and for people to experiment with different APIs.
Abstract: Parallel computers are increasingly
being used to run large-scale applications that also have huge I/O
requirements. However, many applications obtain poor I/O performance on
modern parallel machines. This special issue of IJSA contains papers that
describe the I/O requirements and the techniques used to perform I/O in real
parallel applications. We first explain how the I/O application program
interface (API) plays a critical role in enabling such applications to
achieve high I/O performance. We describe how the commonly used Unix I/O
interface is inappropriate for parallel I/O and how an explicitly parallel
API with support for collective I/O can help the underlying I/O hardware and
software perform I/O efficiently. We then describe MPI-IO, a recently
defined, standard, portable API specifically designed for high-performance
parallel I/O. We conclude with an overview of the papers in this special
issue.
Keywords: parallel I/O application, pario-bib
Abstract: Many large-scale applications on
parallel machines are bottlenecked by the I/O performance rather than the CPU
or communication performance of the system. To improve the I/O performance,
it is first necessary for system designers to understand the I/O requirements
of various applications. This paper presents the results of a study of the
I/O characteristics and performance of a real, I/O-intensive, portable,
parallel application in astrophysics, on two different parallel
machines-the IBM SP and the Intel Paragon. We instrumented the source code
to record all I/O activity, and analyzed the resulting trace files. Our
results show that, for this application, the I/O consists of fairly large
writes, and writing data to files is faster on the Paragon, whereas opening
and closing files are faster on the SP. We also discuss how the I/O
performance of this application could be improved; particularly, we believe
that this application would benefit from using collective I/O.
Keywords: file access pattern, workload
characterization, parallel I/O, pario-bib
Comment: Adds another data point to the collection
of parallel scientific applications whose I/O has been characterized, a
collection started in earnest by crandall:iochar. It's a pretty
straightforward application; it just writes its matrices every few timesteps.
The application writes whole matrices; the OS sees request sizes that are
more a factor of the Chameleon library than of the application. Most of the
I/O itself is not implemented in parallel, because they used UniTree on the
SP, and because the Chameleon library sequentializes this kind of I/O through
one node. Other numbers from the paper don't add much insight into the
workload. Revised slightly in October 1995; the abstract represents that
revision.
Abstract: We present the results of an
experimental evaluation of the parallel I/O systems of the IBM SP and Intel
Paragon using a real three-dimensional parallel application code. This
application, developed by scientists at the University of Chicago, simulates
the gravitational collapse of self-gravitating gaseous clouds. It performs
parallel I/O by using library routines that we developed and optimized
separately for the SP and Paragon. The I/O routines perform two-phase I/O and
use the parallel file systems PIOFS on the SP and PFS on the Paragon. We
studied the I/O performance for two different sizes of the application. In
the small case, we found that I/O was much faster on the SP. In the large
case, open, close, and read operations were only slightly faster, and seeks
were significantly faster, on the SP; whereas, writes were slightly faster on
the Paragon. The communication required within our I/O routines was faster on
the Paragon in both cases. The highest read bandwidth obtained was
48\,Mbytes/sec., and the highest write bandwidth obtained was
31.6\,Mbytes/sec., both on the SP.
Keywords: parallel I/O, multiprocessor file
system, workload characterization, pario-bib
Abstract: This paper presents the results of an
experimental evaluation of the parallel I/O systems of the IBM SP and Intel
Paragon. For the evaluation, we used a full, three-dimensional application
code that is in production use for studying the nonlinear evolution of Jeans
instability in self-gravitating gaseous clouds. The application performs I/O
by using library routines that we developed and optimized separately for
parallel I/O on the SP and Paragon. The I/O routines perform two-phase I/O
and use the PIOFS file system on the SP and PFS on the Paragon. We studied
the I/O performance for two different sizes of the application. We found that
for the small case, I/O was faster on the SP, whereas for the large case, I/O
took almost the same time on both systems. Communication required for I/O was
faster on the Paragon in both cases. The highest read bandwidth obtained was
48 Mbytes/sec. and the highest write bandwidth obtained was 31.6 Mbytes/sec.,
both on the SP.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: This version no longer on the web.
Abstract: In out-of-core computations, data needs
to be moved back and forth between main memory and disks during program
execution. In this paper, we propose a technique called the Extended
Two-Phase Method, for accessing sections of out-of-core arrays efficiently.
This is an extension and generalization of the Two-Phase Method for reading
in-core arrays from files, which was previously proposed in
[Rosario93,Bordawekar93]. The Extended Two-Phase Method uses collective I/O
in which all processors cooperate to perform I/O in an efficient manner by
combining several I/O requests into fewer larger requests, eliminating
multiple disk accesses for the same data and reducing contention for disks.
We describe the algorithms for reading as well as writing array sections.
Performance results on the Intel Touchstone Delta for many different access
patterns are presented and analyzed. It is observed that the Extended
Two-Phase Method gives consistently good performance over a wide range of
access patterns.
Keywords: parallel I/O, pario-bib
Comment: Revised as thakur:ext2phase2 and
thakur:jext2phase.
Abstract: A number of applications on parallel
computers deal with very large data sets which cannot fit in main memory. In
such cases, data must be stored in files on disks and fetched into main
memory during program execution. In programs with large out-of-core arrays
stored in files, it is necessary to read/write smaller sections of the arrays
from/to files. This paper describes a method, called the extended
two-phase method, for accessing sections of out-of-core arrays in an
efficient manner. This method uses collective I/O in which processors
cooperate to combine several I/O requests into fewer larger granularity
requests, reorder requests so that the file is accessed in proper sequence,
and eliminate simultaneous I/O requests for the same data. The I/O workload
is divided among processors dynamically, depending on the access requests. We
present performance results for two real, out-of-core, parallel applications
- matrix multiplication and a Laplace's equation solver - and several
synthetic access patterns. The results indicate that the extended two-phase
method provides a significant performance improvement over a direct method
for I/O.
Keywords: parallel I/O, pario-bib
Comment: Revised version of thakur:ext2phase. The
tech report was itself revised in November 1995; the abstract represents that
revision.
Abstract: A number of applications on parallel
computers deal with very large data sets that cannot fit in main memory. In
such applications, data must be stored in files on disks and fetched into
memory during program execution. Parallel programs with large out-of-core
arrays stored in files must read/write smaller sections of the arrays from/to
files. In this article, we describe a method for accessing sections of
out-of-core arrays efficiently. Our method, the extended two-phase method,
uses collective I/O: Processors cooperate to combine several I/O requests
into fewer larger granularity requests, reorder requests so that the file is
accessed in proper sequence, and eliminate simultaneous I/O requests for the
same data. In addition, the I/O workload is divided among processors
dynamically, depending on the access requests. We present performance results
obtained from two real out-of-core parallel applications-matrix
multiplication and a Laplace's equation solver-and several synthetic access
patterns, all on the Intel Touchstone Delta. These results indicate that the
extended two-phase method significantly outperformed a direct (noncollective)
method for accessing out-of-core array sections.
Keywords: parallel I/O, pario-bib
To improve the I/O performance of parallel programs with distributed
multidimensional arrays, we have developed a software library called Passion
(Parallel, Scalable Software for Input/Output). Passion's routines are
designed to read or write either entire distributed arrays or sections of
such arrays. Passion also frees the programmer from many of the tedious tasks
associated with performing I/O in parallel programs and has a high-level
interface that makes it easy to specify the required I/O. We have
implemented Passion on Intel's Paragon, Touchstone Delta, and iPSC/860
systems, and on the IBM SP system. We have also made it publicly available
through the World Wide Web (http://www.cat.syr.edu/passion.html). We are in
the process of porting the library to other machines and extending its
functionality. Abstract: Parallel computers with peak performance
of more than 100 Gflops/second are already available to solve a variety of
problems in a range of disciplines. However, the input/output performance of
these machines is a poor reflection of their true computational power.
Keywords: parallel I/O, pario-bib
Comment: See thakur:passion, choudhary:passion.
Abstract: MPI-IO, the I/O part of the MPI-2
standard, is a promising new interface for parallel I/O. A key feature of
MPI-IO is that it allows users to access several noncontiguous pieces of data
from a file with a single I/O function call by defining file views with
derived datatypes. We explain how critical this feature is for high
performance, why users must create and use derived datatypes whenever
possible, and how it enables implementations to perform optimizations. In
particular, we describe two optimizations our MPI-IO implementation, ROMIO,
performs: data sieving and collective I/O. We demonstrate the performance and
portability of the approach with performance results on five different
parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI
Origin2000.
Keywords: MPI, parallel I/O, pario-bib
In addition to basic I/O functionality, we consider the
issues of supporting other MPI-IO features, such as 64-bit file sizes,
noncontiguous accesses, collective I/O, asynchronous I/O, consistency and
atomicity semantics, user-supplied hints, shared file pointers, portable data
representation, and file preallocation. We describe how we implemented each
of these features on various machines and file systems. The machines we
consider are the HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, SGI
Origin2000, and networks of workstations; and the file systems we consider
are HP HFS, IBM PIOFS, Intel PFS, NEC SFS, SGI XFS, NFS, and any general Unix
file system (UFS). We also present our thoughts on how a file system can
be designed to better support MPI-IO. We provide a list of features desired
from a file system that would help in implementing MPI-IO correctly and with
high performance. Abstract: We discuss the issues involved in
implementing MPI-IO portably on multiple machines and file systems and also
achieving high performance. One way to implement MPI-IO portably is to
implement it on top of the basic Unix I/O functions (open, lseek,
read, write, and close), which are themselves portable. We
argue that this approach has limitations in both functionality and
performance. We instead advocate an implementation approach that combines a
large portion of portable code and a small portion of code that is optimized
separately for different machines and file systems. We have used such an
approach to develop a high-performance, portable MPI-IO implementation,
called ROMIO.
Keywords: parallel I/O, multiprocessor file system
interface, pario-bib
In addition to basic I/O
functionality, we consider the issues of supporting other MPI-IO features,
such as 64-bit file sizes, noncontiguous accesses, collective I/O,
asynchronous I/O, consistency and atomicity semantics, user-supplied hints,
shared file pointers, portable data representation, file preallocation, and
some miscellaneous features. We describe how we implemented each of these
features on various machines and file systems. The machines we consider are
the HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, SGI Origin2000, and
networks of workstations; and the file systems we consider are HP HFS, IBM
PIOFS, Intel PFS, NEC SFS, SGI XFS, NFS, and any general Unix file system
(UFS). We also present our thoughts on how a file system can be designed
to better support MPI-IO. We provide a list of features desired from a file
system that would help in implementing MPI-IO correctly and with high
performance. Abstract: We discuss the issues involved in
implementing MPI-IO portably on multiple machines and file systems and also
achieving high performance. One way to implement MPI-IO portably is to
implement it on top of the basic Unix I/O functions (open, lseek, read,
write, and close), which are themselves portable. We argue that this approach
has limitations in both functionality and performance. We instead advocate an
implementation approach that combines a large portion of portable code and a
small portion of code that is optimized separately for different machines and
file systems. We have used such an approach to develop a high-performance,
portable MPI-IO implementation, called ROMIO.
Keywords: parallel I/O, multiprocessor file system
interface, pario-bib
Abstract: MPI-IO, the I/O part of the MPI-2
standard, is a promising new interface for parallel I/O. A key feature of
MPI-IO is that it allows users to access several noncontiguous pieces of data
from a file with a single I/O function call by defining file views with
derived datatypes. We explain how critical this feature is for high
performance, why users must create and use derived datatypes whenever
possible, and how it enables implementations to perform optimizations. In
particular, we describe two optimizations our MPI-IO implementation, ROMIO,
performs: data sieving and collective I/O. We present performance results on
five different parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC
SX-4, and SGI Origin2000.
Keywords: MPI, parallel I/O, pario-bib
Abstract: The I/O access patterns of many parallel
applications consist of accesses to a large number of small, noncontiguous
pieces of data. If an application's I/O needs are met by making many small,
distinct I/O requests, however, the I/O performance degrades drastically. To
avoid this problem, MPI-IO allows users to access noncontiguous data with a
single I/O function call, unlike in Unix I/O. In this paper, we explain how
critical this feature of MPI-IO is for high performance and how it enables
implementations to perform optimizations. We first provide a classification
of the different ways of expressing an application's I/O needs in MPI-IO-we
classify them into four levels, called level 0 through level 3. We
demonstrate that, for applications with noncontiguous access patterns, the
I/O performance improves dramatically if users write their applications to
make level-3 requests (noncontiguous, collective) rather than level-0
requests (Unix style). We then describe how our MPI-IO implementation, ROMIO,
delivers high performance for noncontiguous requests. We explain in detail
the two key optimizations ROMIO performs: data sieving for noncontiguous
requests from one process and collective I/O for noncontiguous requests from
multiple processes. We describe how we have implemented these optimizations
portably on multiple machines and file systems, controlled their memory
requirements, and also achieved high performance. We demonstrate the
performance and portability with performance results for three
applications-an astrophysics-application template (DIST3D), the NAS BTIO
benchmark, and an unstructured code (UNSTRUC)-on five different parallel
machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.
Keywords: parallel I/O, parallel I/O, MPI-IO,
collective I/O, data sieving, pario-bib
Keywords: parallel I/O, pario-bib
Comment: Earlier version available as
NPAC/Syracuse tech report. They describe the design of an HPF compiler that
can translate out-of-core programs into plain programs with explicit I/O. For
the most part, they discuss many of the issues involved in manipulating the
arrys, and some of the alternatives for run-time support. The out-of-core
array is broken into pieces, one per processor. Each processor keeps its
local array piece in a file on its own logical disk, and reads and writes
pieces of that file as needed. Some of the tradeoffs appear to contrast the
amount of I/O with the ability to optimize communication: they choose a
method called ``out-of-core communication'' because it simplifies the
analysis of communication patterns, although it requires more I/O. The
compiler depends on run-time routines for support; the run-time routines hide
a lot of the architectural details, simplifying the job of the compiler and
making the resulting program more portable. There are some preliminary
performance numbers.
Abstract: In parallel programs with large
out-of-core arrays stored in files, it is necessary to read/write smaller
sections of the arrays from/to files. We describe a runtime method for
accessing sections of out-of-core arrays efficiently. This method, called the
extended two-phase method, uses collective I/O in which processors
cooperate to read/write out-of-core data in an efficient manner. The I/O
workload is divided among processors dynamically, depending on the access
requests. Performance results on the Intel Touchstone Delta show that the
extended two-phase method performs considerably better than a direct method
for different access patterns, array sizes, and number of processors. We have
used the extended two-phase method in the PASSION runtime library for
parallel I/O.
Keywords: parallel I/O, out-of-core, pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Abstract: We are developing a compiler and runtime
support system called PASSION: Parallel And Scalable Software for
Input-Output. PASSION provides software support for I/O intensive out-of-core
loosely synchronous problems. This paper gives an overview of the PASSION
Runtime Library and describes two of the optimizations incorporated in it,
namely Data Prefetching and Data Sieving. Performance improvements provided
by these optimizations on the Intel Touchstone Delta are discussed, together
with an out-of-core Median Filtering application.
Keywords: parallel I/O, pario-bib
Comment: See thakur:jpassion. They describe the
PASSION library for parallel I/O, though the description is fairly
high-level. The main things that this paper adds to earlier papers from this
group is a discussion of Data Prefetching (which is really just an
asynchronous I/O interface that their compiler uses for prefetching) and Data
Sieving, which they use when the application needs to read some array section
that is not contiguous in the file; for example, a submatrix of a 2-d matrix
from in a file stored row-major. Their solution is to read the complete set
of rows (or columns, depending on file layout) in one huge read, into a
memory buffer, and then extract the necessary data. Basically, this is
another form of the two-phase strategy.
We describe how our MPI-IO
implementation, ROMIO, delivers high performance in the presence of
noncontiguous requests. We explain in detail the two key optimizations ROMIO
performs: data sieving for noncontiguous requests from one process and
collective I/O for noncontiguous requests from multiple processes. We
describe how one can implement these optimizations portably on multiple
machines and file systems, control their memory requirements, and also
achieve high performance. We demonstrate the performance and portability with
performance results for three applications-an astrophysics-application
template (DIST3D), the NAS BTIO benchmark, and an unstructured code
(UNSTRUC)-on five different parallel machines: HP Exemplar, IBM SP, Intel
Paragon, NEC SX-4, and SGI Origin2000. Abstract: The I/O access patterns of parallel
programs often consist of accesses to a large number of small, noncontiguous
pieces of data. If an application's I/O needs are met by making many small,
distinct I/O requests, however, the I/O performance degrades drastically. To
avoid this problem, MPI-IO allows users to access a noncontiguous data set
with a single I/O function call. This feature provides MPI-IO implementations
an opportunity to optimize data access.
Keywords: parallel I/O, collective I/O,
application programmer interface, pario-bib
Comment: They describe how ROMIO, their MPI-IO
implementation, delivers high performance through the use of data sieving and
collective I/O. The paper discusses several specific optimizations. They have
results from five major parallel platforms. The paper confirms that the UNIX
interface is terrible for many parallel access patterns, and that collective
I/O is an important solution.
Abstract: The I/O access patterns of parallel
programs often consist of accesses to a large number of small, noncontiguous
pieces of data. If an application's I/O needs are met by making many small,
distinct I/O requests, however, the I/O performance degrades drastically. To
avoid this problem, MPI-IO allows users to access a noncontiguous data set
with a single I/O function call. This feature provides MPI-IO implementations
an opportunity to optimize data access. We describe how our MPI-IO
implementation, ROMIO, delivers high performance in the presence of
noncontiguous requests. We explain in detail the two key optimizations ROMIO
performs: data sieving for noncontiguous requests from one process and
collective I/O for noncontiguous requests from multiple processes. We
describe how one can implement these optimizations portably on multiple
machines and file systems, control their memory requirements, and also
achieve high performance. We demonstrate the performance and portability with
performance results for three applications- an astrophysics-application
template (DIST3D), the NAS BTIO benchmark, and an unstructured code
(UNSTRUC)- on five different parallel machines: HP Exemplar, IBM SP, Intel
Paragon, NEC SX-4, and SGI Origin2000.
Keywords: parallel I/O, collective I/O,
application programmer interface, pario-bib
Abstract: ROMIO is a high-performance, portable
implementation of MPI-IO (the I/O chapter in MPI-2). This document describes
how to install and use ROMIO version 1.0.0 on various machines.
Keywords: file system interface, parallel I/O,
pario-bib
Abstract: This paper describes the design of a
compiler which can translate out-of-core programs written in a data parallel
language like HPF. Such a compiler is required for compiling large scale
scientific applications, such as the Grand Challenge applications, which deal
with enormous quantities of data. We propose a framework by which a compiler
together with appropriate runtime support can translate an out-of-core HPF
program to a message passing node program with explicit parallel I/O. We
describe the basic model of the compiler and the various transformations made
by the compiler. We also discuss the runtime routines used by the compiler
for I/O and communication. In order to minimize I/O, the runtime support
system can reuse data already fetched into memory. The working of the
compiler is illustrated using two out-of-core applications, namely a Laplace
equation solver and LU Decomposition, together with performance results on
the Intel Touchstone Delta.
Keywords: parallel I/O, pario-bib
Comment: They describe ways to make HPF handle
out-of-core arrays. Basically, they add directives to say which arrays are
out of core, and how much memory to devote to the in-core portion of the
array. Then the compiler distributes the array across processors, as in HPF,
to form local arrays. Each local array is broken into slabs, where each slab
can fit in local memory. The local array is kept in a local array file, from
which slabs are loaded and stored. Ghost nodes are also handled. They were
careful to avoid double I/O when one slab is another slab's ghost node. They
found it most convenient to do all the communication between iterations, then
do all the computation for that iteration, where the iteration itself
required a loop including both computation and I/O. This means that there may
need to be I/O during the communication phase, to store ghost nodes coming in
from other places. They do not mention use of asynchronous I/O for overlap.
See also bordawekar:efficient.
In distributed memory programs, it is often necessary to
change the distribution of arrays during program execution. This thesis
presents efficient and portable algorithms for runtime array redistribution.
The algorithms have been implemented on the Intel Touchstone Delta and are
found to scale well with the number of processors and array size. This thesis
also presents algorithms for all-to-all collective communication on fat-tree
and two-dimensional mesh interconnection topologies. The performance of these
algorithms on the CM-5 and Touchstone Delta is studied extensively. A model
for estimating the time taken by these algorithms on the basis of system
parameters is developed and validated by comparing with experimental results.
A number of applications deal with very large data sets which cannot fit
in main memory, and hence have to be stored in files on disks, resulting in
out-of-core programs. This thesis also describes the design and
implementation of efficient runtime support for out-of-core computations.
Several optimizations for accessing out-of-core data are presented. An
Extended Two-Phase Method is proposed for accessing sections of out-of-core
arrays efficiently. This method uses collective I/O and the I/O workload is
divided among processors dynamically, depending on the access requests.
Performance results obtained using this runtime support for out-of-core
programs on the Touchstone Delta are presented. Abstract: Distributed memory parallel computers or
distributed computer systems are widely recognized as the only cost-effective
means of achieving teraflops performance in the near future. However, the
fact remains that they are difficult to program and advances in software for
these machines have not kept pace with advances in hardware. This thesis
addresses several issues in providing runtime support for in-core as well as
out-of-core programs on distributed memory parallel computers. This runtime
support can be directly used in application programs for greater efficiency,
portability and ease of programming. It can also be used together with a
compiler to translate programs written in a high-level data-parallel language
like High Performance Fortran (HPF) to node programs for distributed memory
machines.
Keywords: parallel I/O, runtime library, pario-bib
Keywords: parallel I/O, connection machine, disk
array, disk architecture, SIMD, pario-bib
Comment: I/O and Data Vault, pp. 27-30
Keywords: computer architecture, connection
machine, MIMD, SIMD, parallel I/O, pario-bib
Comment: Some detail but still skips over some key
aspects (like communication topology. Neat communications support makes for
user-mode message-passing, broadcasting, reductions, all built in. Lots of
info here. File system calls allows data to be transferred in parallel
directly from I/O node to processing node, bypassing the partition and I/O
management nodes. Multiple I/O devices (even DataVaults) can be logically
striped. See also best:cmmdio, loverso:sfs, think:cmmd, think:sda.
Keywords: parallel I/O, disk array, striping,
RAID, HIPPI, pario-bib
Comment: More detail about I/O nodes than
think:sda, including info about disk storage nodes, HIPPI nodes, and tape
nodes (ITS).
Keywords: MIMD, parallel programming, parallel
I/O, message-passing, pario-bib
Keywords: parallel I/O, disk array, striping,
RAID, pario-bib
Comment: Disk storage nodes (processor, network
interface, buffer, 4 SCSI controllers, 8 disks) attach individually to the
CM-5 network. The software stripes across all nodes in the system. Thus, the
collection of nodes is called a disk array. Multiple file systems across the
array. Flexible redundancy. RAID 3 is used, i.e., bit-striped and a single
parity disk. Remote access via NFS supported. Files stored in canonical
order, with special hardware to help distribute data across processors. See
best:cmmdio.
We have found that
the Galley File System provides a good environment on which to build
high-performance libraries, and that the mesh of Panda and Galley was a
successful combination. Abstract: The Panda Array I/O library, created at
the University of Illinois, Urbana-Champaign, was built especially to address
the needs of high-performance scientific applications. I/O has been one of
the most frustrating bottlenecks to high performance for quite some time, and
the Panda project is an attempt to ameliorate this problem while still
providing the user with a simple, high-level interface. The Galley File
System, with its hierarchical structure of files and strided requests, is
another attempt at addressing the performance problem. My project was to
redesign the Panda Array library for use on the Galley file system. This
project involved porting Panda's three main functions: a checkpoint function
for writing a large array periodically for 'safekeeping,' a restart function
that would allow a checkpointed file to be read back in, and finally a
timestep function that would allow the user to write a group of large arrays
several times in a sequence. Panda supports several different distributions
in both the compute-node memories and I/O-node disks.
Keywords: multiprocessor file system, parallel
I/O, pario-bib
Comment: See seamons:thesis.
Abstract: Magnetic disks, which together with disk
arrays constitute a multibillion dollar industry, were developed in 1950s.
Disks were an advance over magnetic drums, which had a dedicated read/write
head per track, since much higher amounts of data could be accessed in a cost
effective manner due to the sharability of the movable read/write heads. DRAM
memories, which are volatile, were projected to replace disks a decade ago
(see Section 2.4 in [33]). This did not materialize due to the inherent
volatility of DRAM, i.e., a power source is required to ensure that DRAM
contents are not lost, but also due to recent dramatic increases in areal
recording density and hence disk capacity, which is estimated at 60% compound
annual growth rate - CAGR. This has resulted in a rapid decrease in cost per
megabyte of disk capacity, so that it is lower than DRAM by a factor of 1000
to one.
Keywords: data allocation, scheduling, disk
arrays, pario-bib
Abstract: Modern scientific computing involves
organizing, moving, visualizing, and analyzing massive amounts of data at
multiple sites around the world. The technologies, the middleware services,
and the architectures that are used to build useful high-speed, wide area
distributed systems, constitute the field of data intensive computing. In
this paper we will describe an architecture for data intensive applications
where we use a high-speed distributed data cache as a common element for all
of the sources and sinks of data. This cache-based approach provides standard
interfaces to a large, application-oriented, distributed, on-line, transient
storage system. We describe our implementation of this cache, how we have
made it "network aware," and how we do dynamic load balancing based on the
current network conditions. We also show large increases in application
throughput by access to knowledge of the network conditions.
Keywords: distributed cache, distributed
computing, grid, input/output, network-aware, parallel I/O, pario-bib
Comment: They discuss their implemetation of a
"netowork aware" data cache (Distributed Parallel Storage System) that adapts
to changing network conditions. The system itself looks much like the Galley
File System. The client library is multi-threaded with a client thread for
each DPSS server. A DPSS server is composed of a a block request thread, a
block writer thread, a shared disk cache and a reader thread for each disk.
Block requests move into the shared cache from the disks. A DPSS master
directs the clients requests to an appropriate DPSS server. They use Java
agents to monitor network performance and use a data replication for load
balancing. A minimum cost flow algorithm is run each time a client request
arrives to detirmine the best place to retrieve the data block. They argue
that since the algorithm is fast (< 1 ms), the overhead of the algorithm is
not significant.
Keywords: parallel I/O, file system interface,
multiprocessor file system, pario-bib
Comment: Have two types of files, parallel and
serial, differing in the way data is laid out internally. Also have three
modes for reading the file: synchronous, streaming (asynchronous), and
buffered.
Abstract: We report here on a project that expands
the applicability of dynamic climate modeling to very long time scales. The
Fast Ocean_Atmosphere Model (FOAM) is a coupled ocean-atmosphere model that
incorporates physics of interest in understanding decade to century time
scale variability. It addresses the high computational cost of this endeavor
with a combination of improved ocean model formulation, low atmosphere
resolution, and efficient coupling. It also uses message-passing parallel
processing techniques, allowing for the use of cost-effective distributed
memory platforms. The resulting model runs over 6000 times faster than real
time with good fidelity and has yielded significant results.
Keywords: parallel I/O, scientific application,
pario-bib
Comment: This paper is about the Fast
Ocean-Atmosphere Model (FOAM), a climate model that uses ''a combination of
new model formulation and parallel computing to expand the time horizon that
may be addressed by explicit fluid dynamical representations of the climate
system.'' Their model uses message passing on massively parallel
distributed-memory computer systems. They are in the process of investigating
using parallel I/O to further increase their efficiency.
Abstract: SOLAR is a portable high-performance
library for out-of-core dense matrix computations. It combines portability
with high performance by using existing high-performance in-core subroutine
libraries and by using an optimized matrix input-output library. SOLAR works
on parallel computers, workstations, and personal computers. It supports
in-core computations on both shared-memory and distributed-memory machines,
and its matrix input-output library supports both conventional I/O interfaces
and parallel I/O interfaces. This paper discusses the overall design of
SOLAR, its interfaces, and the design of several important subroutines.
Experimental results show that SOLAR can factor on a single workstation an
out-of-core positive-definite symmetric matrix at a rate exceeding 215
Mflops, and an out-of-core general matrix at a rate exceeding 195 Mflops.
Less than 16% of the running time is spent on I/O in these computations.
These results indicate that SOLAR's portability does not compromise its
performance. We expect that the combination of portability, modularity, and
the use of a high-level I/O interface will make the library an important
platform for research on out-of-core algorithms and on parallel I/O.
Keywords: parallel I/O, out-of-core, linear
algebra, pario-bib
Comment: Sounds great. Library package that
supports LAPACK-like functionality on in-core and out-of-core matrices. Good
performance. Good portability (IBM workstation, IBM SP-2, and OS/2 laptop).
They separate the matrix algorithms from the underlying I/O routines in an
interesting way (read and write submatrices), leaving just enough information
to allow the I/O system to do some higher-level optimizations.
Keywords: out-of-core algorithm, survey, numerical
analysis, linear algebra, pario-bib
Comment: See also the component papers
vitter:survey, arge:lower, crauser:segment, grossi:crosstrees, toledo:survey.
Not clear to what extent these papers are about *parallel* I/O.
Keywords: pario-bib
In this work, we present a new parallel interface for
writing and reading netCDF datasets. This interface is derived with minimal
changes from the serial netCDF interface but defines semantics for parallel
access and is tailored for high performance. The underlying parallel I/O is
achieved through MPI-IO, allowing for substantial performance gains through
the use of collective I/O optimizations. We compare the implementation
strategies and performance with HDF5. Our tests indicate programming
convenience and significant I/O performance improvement with this parallel
netCDF (PnetCDF) interface. Abstract: Dataset storage, exchange, and access
play a critical role in scientific applications. For such purposes netCDF
serves as a portable, efficient file format and programming interface, which
is popular in numerous scientific application domains. However, the original
interface does not provide an efficient mechanism for parallel data storage
and access.
Keywords: parallel I/O interface, netCDF, MPI-IO,
pario-bib
Comment: published on the web only
Keywords: parallel processing, parallel I/O,
pario-bib
Comment: Difficult to follow since it is missing
its figures. ``Our most important result is that multiprocessor systems can
benefit considerably more than single processor systems with the introduction
of CPU: I/O overlap.'' They overlap I/O needed by some future CPU sequence
with the current CPU operation. They claim it looks good for large numbers of
processors. Their orientation seems to be for multiprocessors operating on
independent tasks.
Keywords: parallel processing, parallel I/O,
pario-bib
Comment: Models CPU:I/O and I/O:I/O overlap within
a program. ``Overlapping is helpful only when it allows a device to be
utilized which would not be utilized without overlapping.'' In general the
overlapping seems to help.
Keywords: parallel I/O, sparse matrix, pario-bib
Comment: They discuss a library to support
irregular data structures, really sparse matrices, on distributed-memory
machines. Their library supports several in-memory and out-of-core data
distributions, and routines to read and write matrices in those
distributions. The paper is sketchy and poorly written. There is little
material on I/O.
Abstract: Bursty application I/O patterns,
together with transfer limited storage devices, combine to create a major I/O
bottleneck on parallel systems. This paper explores the use of time series
models to forecast application I/O request times, then prefetching I/O
requests during computation intervals to hide I/O latency. Experimental
results with I/O intensive scientific codes show performance improvements
compared to standard UNIX prefetching strategies.
Keywords: pario-bib, access pattern, prefetching,
modeling, time-series analysis
Abstract: Inadequate I/O performance remains a
major challenge in using high-end computing systems effectively. To address
this problem, the paper presents TsModeler, an automatic time series modeling
and prediction framework for adaptive I/O prefetching that uses ARIMA time
series models to predict the temporal patterns of I/O requests. These online
pattern analysis techniques and cutoff indicators for autocorrelation
patterns enable multistep online predictions suitable for multiblock
prefetching. This work also combines time series predictions with spatial
Markov model predictions to determine when, what, and how many blocks to
prefetch. Experimental results show reductions in execution time compared to
the standard Linux file system across various hardware configurations.
Keywords: pario-bib, access pattern, prefetching,
modeling, time-series analysis
Abstract: Disk array systems are rapidly becoming
the secondary-storage media of choice for many emerging applications with
large storage and high bandwidth requirements. Striping data across the disks
of a disk array introduces significant performance benefits mainly because
the effective transfer rate of the secondary storage is increased by a factor
equal to the stripe width. However, the choice of the optimal stripe width is
an open problem: no general formal analysis has been reported and intuition
alone fails to provide good guidelines. As a result one may find occasionally
contradictory recommendations in the literature. With this work we first
contribute an analytical calculation of the optimal stripe width. Second, we
recognize that the optimal stripe width is sensitive to the multiprogramming
level, which is not known a priori and fluctuates with time. Thus,
calculations of the optimal stripe width are, by themselves only, of little
practical use. For this reason we propose a novel striping technique, called
overlay striping, which allows objects to be retrieved using a number of
alternative stripe widths. We provide the detailed algorithms for our overlay
striping method and we study the associated storage overhead and performance
improvements and we show that we can achieve near optimal performance for
very wide ranges of the possible multiprogramming levels, while incurring
small storage overheads.
Keywords: parallel I/O, striping, pario-bib
Comment: Part of a special issue.
Keywords: parallel file system, pario-bib
Comment: A description of their new parallel file
system for the AP-1000. Conceptually, not much new here.
Abstract: This paper considers the performance of
cached RAID5 using simulations that are driven by database I/O traces
collected at customer sites. This is in contrast to previous performance
studies using analytical modelling or random-number simulations. We studied
issues of cache size, disk buffering, cache replacement policies, cache
allocation policies, destage policies and striping. Our results indicate
that: read caching has considerable value; a small amount of cache should be
used for writes fast write logic can reduce disk utilization for writes by an
order of magnitude; priority queueing should be supported at the disks; disk
buffering prefetch should be used; for large caches, it pays to cache
sequentially accessed blocks; RAID5 with cylinder striping is superior to
parity striping.
Keywords: parallel I/O, RAID, disk array,
pario-bib
Abstract: A flexible intermediate library named
Stampi realizes seamless MPI operations on interconnected parallel computers.
Dynamic process creation and MPI-I/O operations both inside a computer and
among computers are available with it. MPI-I/O operations to a remote
computer are realized by MPI-I/O processes of the Stampi library which are
invoked on a remote computer using a vendor-supplied MPI-I/O library. If the
vendor-supplied one is not available, a single MPI-I/O process is invoked on
a remote computer, and it uses UNIX I/O functions instead of the
vendor-supplied one. In nonblocking MPI-I/O functions with multiple user
processes, the single MPI-I/O process carries out I/O operations required by
the processes sequentially. This results in small overlap of computation by
the user processes with I/O operations by the MPI-I/O process. Therefore
performance of the nonblocking functions is poor with multiple user
processes. To realize effective I/O operations, a Pthreads library has been
implemented in the MPI-I/O mechanism, and multi-threaded I/O operations have
been realized. The newly implemented MPI-I/O mechanism has been evaluated on
inter-connected PC clusters, and higher overlap of the computation with the
I/O operations has been achieved.
Keywords: stampi, MPI-I/O, dynamic process
creation, multithreaded, overlap computation and I/O, pario-bib
Comment: also see tsujita:stampi*.
Abstract: An MPI-2 based parallel-I/O library,
Stampi-I/O, has been developed using flexible communication infrastructure.
In Stampi-I/O almost all MPI-I/O functions have been implemented. We can
execute these functions using both local and remote I/O operations with the
same application program interface (API) based on MPI-2. In I/O operations
using Stampi-I/O, users need not handle any differences in the communication
mechanism of computers. We have evaluated performance for primitive functions
in Stampi-I/O. Through this test, sufficient performance has been achieved
and effectiveness of our flexible implementation has been confirmed.
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Abstract: A flexible intermediate library named
Stampi realizes seamless MPI operations on a heterogeneous computing
environment. With the help of a flexible communication mechanism of this
library, users can execute MPI functions without awareness of underlying
communication mechanism. Although Stampi supports MPI-I/O among different
platforms, UNIX I/O functions are used when a vendor-supplied MPI-I/O library
is not available. To realize distributed I/O operations, a parallel virtual
file system (PVFS) has been implemented in the MPI-I/O mechanism. Primitive
MPI-I/O functions of Stampi have been evaluated and sufficient performance
has been achieved. (9 refs.)
Keywords: MPI-IO, PVFS, remote I/O, grid,
pario-bib
Comment: also see tsujita:stampi.
Abstract: The steady increase of computing power
at lower and lower cost enables molecular dynamics simulations to investigate
the process of protein folding with an explicit treatment of water molecules.
Such simulations are typically done with well known computational chemistry
codes like CHARMM. Desktop grids such as the United Devices MetaProcessor are
highly attractive platforms, since scavenging for unused machines on Intra-
and Internet delivers compute power that is almost free. However, the
predominant programming paradigm for current desktop grids is pure task
parallelism and might not fit the needs for protein folding simulations with
explicit water molecules. A short overall turn-around time of a simulation
remains highly important for research productivity, but the need for an
accurate model and long simulation time-scales leads to tasks that are too
large for optimal scheduling on a desktop grid. To address this problem, we
introduce a combination of task- and data parallelism as a well suitable
computing paradigm for protein folding investigations on grid platforms. As a
proof of concept, we design and implement a simple system for protein folding
simulations based on the notion of combined task and data parallelism with
clustered workers. Clustered workers are machines grouped into small clusters
according to network and CPU performance criteria and act as super-nodes
within a desktop grid, permitting the utilization of data parallelism in
addition to the task parallelism. We integrate our new paradigm into the
existing software environment of the United Devices MetaProcessor. For a test
protein, we reach a better quality of the folding calculations than we
reached using just task parallelism on distributed systems.
Keywords: protein folding, grid application,
parallel I/O, pario-app, pario-bib
Abstract: Current disk arrays, the basic building
blocks of high-performance storage systems, are built around two memory
technologies: magnetic disk drives, and non-volatile DRAM caches. Disk
latencies are higher by six orders of magnitude than non-volatile DRAM access
times, but cache costs over 1000 times more per byte. A new storage
technology based on microelectromechanical systems (MEMS) will soon offer a
new set of performance and cost characteristics that bridge the gap between
disk drives and the caches. We evaluate potential gains in performance and
cost by incorporating MEMS-based storage in disk arrays. Our evaluation is
based on exploring potential placements of MEMS-based storage in a disk
array. We used detailed disk array simulators to replay I/O traces of real
applications for the evaluation. We show that replacing disks with MEMS-based
storage can improve the array performance dramatically, with a cost
performance ratio several times better than conventional arrays even if MEMS
storage costs ten times as much as disk. We also demonstrate that hybrid
MEMS/disk arrays, which cost less than purely MEMS-based arrays, can provide
substantial improvements in performance and cost/performance over
conventional arrays.
Keywords: mems-based storage, disk arrays,
pario-bib
Comment: Best paper in fast2003.
Keywords: multimedia, distributed file system,
disk striping, pario-bib
Comment: A DFS for multimedia. Expect large files,
read-mostly, highly sequential. Temporal synchronization is key. An
administration server handles opens and closes, and provides guarantees on
performance (like Swift). The interface at the client nodes talks to the
admin server transparently, and stripes requests over all storage nodes.
Storage nodes may internally use RAIDs, I suppose. Files are a series of
frames, rather than bytes. Each frame has a time offset in seconds. Seeks can
be by frame number or time offset. File containers contain several files, and
have attributes that specify performance requirements. Interface does
prefetching, based on read direction (forward or backward) and any frame
skips. But frames are not transmitted from storage server to client node
until requested (client pacing). Claim that synchronous disk interleaving
with a striping unit of one frame is best. Could get 30 frames/sec (3.5MB/s)
with 2 DECstation 5000s and 4 disks, serving a client DEC 5000.
Keywords: unix, multiprocessor file system,
pario-bib
Comment: How to split up the internals of the Unix
I/O system to run on a shared-memory multiprocessor in a non-symmetric OS.
They decided to split the functionality just above the buffer cache level,
putting the buffer cache management and device drivers on the special I/O
processors.
Keywords: parallel I/O, pario-bib
Comment: Using a hardware monitor they measure the
I/O-bus usage on a 4-processor Sun workstation. They characterize the bus
contention caused by multiple different devices (disk, screen, and network).
The contention sometimes caused significant performance degradation (to the
end-user) despite the bus not being overloaded.
Abstract: None.
Keywords: parallel I/O, pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Abstract: The performance modeling and analysis of
disk arrays is challenging due to the presence of multiple disks, large array
caches, and sophisticated array controllers. Moreover, storage manufacturers
may not reveal the internal algorithms implemented in their devices, so real
disk arrays are effectively black-boxes. We use standard performance
techniques to develop an integrated performance model that incorporates some
of the complexities of real disk arrays. We show how measurement data and
baseline performance models can be used to extract information about the
various features implemented in a disk array. In this process, we identify
areas for future research in the performance analysis of real disk arrays.
Keywords: performance analysis, disk arrays,
performance modeling, pario-bib
Keywords: parallel I/O, disk array, RAID, disk
caching, pario-bib
Comment: Part of jin:io-book; reformatted version
of varma:destage.
Abstract: In a disk array with a nonvolatile write
cache, destages from the cache to the disk are performed in the background
asynchronously while read requests from the host system are serviced in the
foreground. In this paper, we study a number of algorithms for scheduling
destages in a RAID-5 system. We introduce a new scheduling algorithm, called
linear threshold scheduling, that adaptively varies the rate of destages to
disks based on the instantaneous occupancy of the write cache. The
performance of the algorithm is compared with that of a number of alternative
scheduling approaches, such as least-cost scheduling and high/low mark. The
algorithms are evaluated in terms of their effectiveness in making destages
transparent to the servicing of read requests from the host, disk
utilization, and their ability to tolerate bursts in the workload without
causing an overflow of the write cache. Our results show that linear
threshold scheduling provides the best read performance of all the algorithms
compared, while still maintaining a high degree of burst tolerance. An
approximate implementation of the linear-threshold scheduling algorithm is
also described. The approximate algorithm can be implemented with much lower
overhead, yet its performance is virtually identical to that of the ideal
algorithm.
Keywords: parallel I/O, disk array, RAID, disk
caching, pario-bib
Abstract: The I/O performance of applications in
multiple-disk systems can be improved by overlapping disk accesses. This
requires the use of appropriate prefetching and buffer management algorithms
that ensure the most useful blocks are accessed and retained in the buffer.
In this paper, we answer several fundamental questions on prefetching and
buffer management for distributed-buffer parallel I/O systems. First, we
derive and prove the optimality of an algorithm, P-min, that minimizes the
number of parallel I/Os. Second, we analyze P-con, an algorithm that always
matches its replacement decisions with those of the well-known demand-paged
MIN algorithm. We show that P-con can become fully sequential in the worst
case. Third, we investigate the behavior of on-line algorithms for
multiple-disk prefetching and buffer management. We define and analyze P-Iru,
a parallel version of the traditional LRU buffer management algorithm.
Unexpectedly, we find that the competitive ratio of P-Iru is independent of
the number of disks. Finally, we present the practical performance of these
algorithms on randomly generated reference strings. These results confirm the
conclusions derived from the analysis on worst case inputs.
Keywords: parallel I/O, prefetching, pario-bib
Keywords: file prefetching, cost-benefit analysis,
parallel I/O pario-bib
Comment: They describe a prefetching scheme which
prefetches blocks using a cost-benefit analysis scheme based on the
probability that the block will be accessed. The benefit of prefetching a
block is compared to the cost of replacing another block from the cache. They
were able to reduce cache miss rates by 36% for workloads which receive no
benefit from sequential prefetching.
Keywords: parallel I/O, parallel I/O algorithms,
pario-bib
Comment: Interesting interface, providing
high-level data-parallel access to vectors of data on disk. Implementation
expectation is to use raw disk devices. Goals: abstraction, support for
algorithmic optimality, flexible, portable, and extensible. TPIE is a set of
C++ templates and libraries, where the user supplies callback functions to
TPIE access methods. TPIE contains a small variety of access methods, each of
which operates on a set of input and output streams, calling the user's
function once for each set of input records. They can do scan, merge,
distribution, sort, permute, batch filter, and distribution-sweep. There is a
single thread of control (at least conceptually). Their first prototype is on
a Sun SPARCstation; later, clusters of workstations and then a
multiprocessor. See vengroff:efficient, vengroff:tpie-man.
Keywords: parallel I/O, algorithm, run-time
library, pario-bib
Comment: Shorter version of vengroff:efficient2.
Excellent paper. This paper does not describe TPIE itself very much, but more
about a set of benchmarks using TPIE. All of the benchmarks are run on one
disk and one processor. TPIE can use multiple disks and one processor, with
plans to extend it to multiple processors later. See vengroff:tpie and
vengroff:efficient-tr. Same as vengroff:efficient2?
Keywords: parallel I/O algorithm, scientific
computing, runtime library, pario-bib
Comment: Expanded version of vengroff:efficient.
Keywords: parallel I/O algorithms, run-time
support, parallel I/O, multiprocessor file system interface, pario-bib
Comment: Longer version of vengroff:efficient.
Keywords: parallel I/O algorithm, pario-bib
Keywords: parallel I/O, parallel I/O algorithm,
file system interface, pario-bib
Comment: Currently an alpha version. It is in the
process of being updated. The most current version is generally available on
the WWW. See vengroff:tpie, vengroff:efficient.
Abstract: Performance of I/O intensive
applications on a multiprocessor system depends mostly on the variety of disk
access delays encountered in the I/O system. Over the years, the improvement
in disk performance has taken place more slowly than the corresponding
increase in processor speeds. It is therefore necessary to model I/O delays
and evaluate performance benefits of moving an application to a better
multiprocessor system. We perform such an analysis by measuring I/O delays
for a synthesized application that uses a parallel distributed file system.
The aim of this study is to evaluate the performance benefits of better disks
in a multiprocessor system. We report on how the I/O performance would be
affected if an application were to run on a system which would have better
disks and communication links. In this study, we show a substantial
improvement in the performance of an I/O system with better disks and
communication links with respect to the existing system.
Keywords: parallel I/O, pario-bib
Abstract: The computer scientists of EPFL (Prof.
R.D. Hersch and his staff), in collaboration with the Geneva Hospitals and
WDS Technologies SA, have developed a parallel image server to extract image
slices of the Visible Human from any orientation. This 3D dataset originates
from a prisoner sentenced to death who offered his body to science. The dead
body was frozen and then cut and digitized into 1 mm horizontally spaced
slices by the National Library of Medicine, Bethesda-Maryland and the
University of Colorado, USA. The total volume of all slices represents a size
of 13 Gbyte of data.
Keywords: image processing, parallel I/O,
pario-bib
Comment: Very cool. See also gennart:CAP,
messerli:tomographic, messerli:jimage, messerli:thesis.
Abstract: I/O bottlenecks are already a problem in
many large-scale applications that manipulate huge datasets. This problem is
expected to get worse as applications get larger, and the I/O subsystem
performance lags behind processor and memory speed improvements. Caching I/O
blocks is one effective way of alleviating disk latencies, and there can be
multiple levels of caching on a cluster of workstations. Previous studies
have shown the benefits of caching - whether it be local to a particular
node, or a shared global cache across the cluster - for certain applications.
However, we show that while caching is useful in some situations, it can hurt
performance if we are not careful about what to cache and when to bypass the
cache. This paper presents compilation techniques and runtime support to
address this problem. These techniques are implemented and evaluated on an
experimental Linux/Pentium cluster running a parallel file system. Our
results using a diverse set of applications (scientific and commercial)
demonstrate the benefits of a discretionary approach to caching for I/O
subsystems on clusters, providing as much as 33% savings over
indiscriminately caching everything in some applications.
Keywords: caching, parallel I/O, pario-bib
Abstract: The ever-increasing gap in performance
between CPU/memory technologies and the I/O subsystem (disks, I/O buses) in
modern workstations has exacerbated the I/O bottlenecks inherent in
applications that access large disk resident data sets. A common technique to
alleviate the I/O bottlenecks on clusters of workstations, is the use of
parallel file systems. One such parallel file system is the Parallel Virtual
File System (PVFS), which is a freely available tool to achieve
high-performance I/O on Linux-based clusters. In this paper we describe the
performance and scalability of the UNIX I/O interface to PVFS. To illustrate
the performance, we present experimental results using Bonnie++, a commonly
used file system benchmark to test file system throughput; a synthetic
parallel I/O applicationfor calculating aggregate read and write bandwidths;
and a synthetic benchmark which calculates the time taken to untar the Linux
kernel source tree to measure performance of a large number of small file
operations. We obtained aggregate read and write bandwidths as high as 550
MB/s with a Myrinet-based network and 160MB/s with fast Ethernet.
Keywords: posix I/O interface, performance, PVFS,
parallel file system, pario-bib
Keywords: prefetching, data compression, pario-bib
Keywords: parallel I/O algorithms, parallel
memory, pario-bib
Comment: Summary of vitter:parmem1 and
vitter:parmem2.
Abstract: We provide the first optimal algorithms
in terms of the number of input/outputs (I/Os) required between internal
memory and multiple secondary storage devices for the problems of sorting,
FFT, matrix transposition, standard matrix multiplication, and related
problems. Our two-level memory model is new and gives a realistic treatment
of parallel block transfer, in which during a single I/O each of the
$P$ secondary storage devices can simultaneously transfer a contiguous block
of $B$ records. The model pertains to a large-scale uniprocessor system or
parallel multiprocessor system with $P$ disks. In addition, the sorting, FFT,
permutation network, and standard matrix multiplication algorithms are
typically optimal in terms of the amount of internal processing time. The
difficulty in developing optimal algorithms is to cope with the partitioning
of memory into $P$ separate physical devices. Our algorithms' performance can
be significantly better than those obtained by the well-known but nonoptimal
technique of disk striping. Our optimal sorting algorithm is randomized, but
practical; the probability of using more than $\ell$ times the optimal number
of I/Os is exponentially small in $\ell (\log \ell) \log (M/B)$, where $M$ is
the internal memory size.
Keywords: parallel I/O algorithms, parallel
memory, pario-bib
Comment: See shorter version vitter:optimal. See
TR version vitter:parmem1-tr. See also vitter:parmem2.
Keywords: parallel I/O algorithms, parallel
memory, pario-bib
Comment: Summarized in vitter:optimal. Published
as vitter:parmem1.
Abstract: In this paper we introduce parallel
versions of two hierarchical memory models and give optimal algorithms in
these models for sorting, FFT, and matrix multiplication. In our parallel
models, there are $P$ memory hierarchies operating simultaneously;
communication among the hierarchies takes place at a base memory level. Our
optimal sorting algorithm is randomized and is based upon the probabilistic
partitioning technique developed in the companion paper for optimal disk
sorting in a two-level memory with parallel block transfer. The probability
of using $\ell$ times the optimal running time is exponentially small
in $\ell (\log \ell) \log P$.
Keywords: parallel I/O algorithms, parallel
memory, pario-bib
Comment: Summarized in vitter:optimal.
Keywords: parallel I/O algorithms, parallel
memory, pario-bib
Comment: Summarized in vitter:optimal.
Abstract: Caching and prefetching are important
mechanisms for speeding up access time to data on secondary storage. Recent
work in competitive online algorithms has uncovered several promising new
algorithms for caching. In this paper, we apply a form of the competitive
philosophy for the first time to the problem of prefetching to develop an
optimal universal prefetcher in terms of fault ratio, with particular
applications to large-scale databases and hypertext systems. Our algorithms
for prefetching are novel in that they are based on data compression
techniques that are both theoretically optimal and good in practice.
Intuitively, in order to compress data effectively, you have to be able to
predict future data well, and thus good data compressors should be able to
predict well for purposes of prefetching. We show for powerful models such as
Markov sources and $m$th order Markov sources that the page fault rates
incurred by our prefetching algorithms are optimal in the limit for almost
all sequences of page accesses.
Keywords: parallel I/O algorithms, disk
prefetching, pario-bib
Comment: ``This... is on prefetching, but I think
the ideas will have a lot of use with parallel disks. The implementations we
have now are doing amazingly well compared to LRU.'' [Vitter]. See
vitter:jprefetch.
Keywords: parallel I/O algorithms, sorting,
pario-bib
Comment: Good overview of all the other papers.
Keywords: out-of-core algorithm, pario-bib
Comment: Earlier shorter versions entitled
"External Memory Algorithms" appear as an invited tutorial in Proceedings of
the 17th ACM Symposium on Principles of Database Systems, Seattle, WA, June
1998, 119-128, and as an invited paper in Proceedings of the 6th Annual
European Symposium on Algorithms, Venice, Italy, August 1998, 1-25,
published in Lecture Notes in Computer Science, 1461, Springer-Verlag, Berlin
Keywords: parallel I/O algorithm, sorting,
pario-bib
Comment: Summary is nodine:sort.
Keywords: disk striping, pario-bib
Comment: Describes the VAX disk striping driver.
Stripes an apparently arbitrary number of disk devices. All devices must be
the same type, and apparently completely used. Manager can specify
``chunksize'', the number of logical blocks per striped block. They suggest
using the track size of the device as the chunk size. They also point out
that multiple controllers should be used in order to gain parallelism.
Keywords: distributed shared memory, cooperative
caching, parallel I/O, pario-bib
Abstract: This paper is concerned with efficient
execution of applications that are composed of chain of sequential data
processes, which exchange data through a file system. We focus on
pipeline-shared I/O behavior within a single pipeline of processes running on
a cluster We examine several scheduling strategies and experimentally
evaluate them for efficient use the Parallel Virtual File System (PVFS) as a
common storage pool.
Keywords: PVFS, pipelined-shared I/O, grid
computing, pario-bib
The applications described are: 1)
prediction of credit card "defaulters" (non-payers) and "attritters" (people
who didn't renew their cards) from a credit card database; 2) prediction of
the continuation of time series, e.g. stock price movements; 3) automatic
keyword assignment for news articles; and 4) protein secondary structure
prediction. These add to a list identified in an earlier paper [Waltz 90]
including: 5) automatic classification of U.S. Census Bureau long forms,
using MBR - Memory-Based Reasoning [Creecy et al 92, Waltz 89, Stanfill &
Waltz 86]; 6) generating catalogs for a mail order company that maximize
expected net returns (revenues from orders minus cost of the catalogs and
mailings) using genetically-inspired methods; and 7) text-based intelligent
systems for information retrieval, decision support, etc. Abstract: Massively parallel applications must
address problems that will be too large for workstations for the next several
years, or else it will not make sense to expend development costs on them.
Suitable applications include one or more of the following properties: 1)
large amounts of data; 2) intensive computations; 3) requirement for very
fast response times; 4) ways to trade computations for human effort, as in
developing applications using learning methods. Most of the suitable
applications that we have found come from the general area of very large
databases. Massively parallel machines have proved to be important not only
in being able to run large applications, but in accelerating development
(allowing the use of simpler algorithms, cutting the time to test performance
on realistic databases) and allowing many different algorithms and parameter
settings to be tried and compared for a particular task. This paper
summarizes four such applications.
Keywords: database, AI, artificial intelligence,
pario-bib
Comment: Invited speaker.
Keywords: parallel I/O, virtual memory, paging,
characterization, pario-bib
Comment: They measured the paging behavior of
programs running on a Paragon, and analyze the results. To do so, they sample
the OSF paging statistics periodically. The general conclusions: they found a
surprising amount of dissimilarity in the paging behaviors of nodes within
the same program, both in terms of the amount of paging and the timing of
peak paging activity. These characteristics do not bode well for systems that
use gang scheduling, or applications that have a lot of barriers.
We analyze application traces
from a cluster with hundreds of nodes. On average, each application has only
one or two typical request sizes. Large requests from several hundred
kilobytes to several megabytes are very common. Although in some
applications, small requests account for more than 90% of all requests,
almost all of the I/O data are transferred by large requests. All of these
applications show bursty access patterns. More than 65% of write requests
have inter-arrival times within one millisecond in most applications. By
running the same benchmark on different file models, we also find that the
write throughput of using an individual output file for each node exceeds
that of using a shared file for all nodes by a factor of 5. This indicates
that current file systems are not well optimized for file sharing.
Abstract: Parallel scientific applications require
high-performance I/O support from underlying file systems. A comprehensive
understanding of the expected workload is therefore essential for the design
of high-performance parallel file systems. We re-examine the workload
characteristics in parallel computing environments in the light of recent
technology advances and new applications.
Keywords: file system workload, workload
characterization, ASCI, lustre, scientific applications, pario-app, pario-bib
Comment: An I/O workload study of three
applications on a 960 node (dual-processors) cluster at LLNL running the
lustre-light parallel file system. The applications include a I/O
benchmarking code (ior2) and two physics simulations: one that ran on 343
processors and one that ran on 1620 processors.
Abstract: Datasets up to terabyte size and
petabyte total capacities have created a serious imbalance between I/O and
storage-system performance and system functionality. One promising approach
is the use of parallel data-transfer techniques for client access to storage,
peripheral-to-peripheral transfers, and remote file transfers. This paper
describes the parallel I/O architecture and mechanisms, parallel transport
protocol (PTP), parallel FTP, and parallel client application programming
interface (API) used by the high-performance storage system (HPSS). Parallel
storage integration issues with a local parallel file system are also
discussed.
Keywords: mass storage, parallel I/O,
multiprocessor file system interface, pario-bib
Abstract: This paper describes a new scheme for
remote file access called Smart File Objects (SFO). The SFO is an
object-oriented application-specific file access paradigm designed to attack
the bottleneck imposed by high latency, low bandwidth networks such as
wide-area and wireless networks. The SFO uses application and network
information to adaptively prefetch needed data in parallel with the execution
of the application. The SFO can offer additional advantages such as
non-blocking I/O, bulk I/O, improved file access APIs, and increased
reliability. We describe the SFO concept, a prototype implementation in the
Mentat system, and the results obtained with a distributed gene sequence
application running across the Internet and vBNS. The results show the
potential of the SFO approach to improve application performance.
Keywords: object, parallel I/O, pario-bib
Keywords: I/O, active storage, TPIE, grid,
parallel I/O, pario-bib
Comment: Very interesting talk... an extension of
the TPIE work. They assign a mapping of computations to storage-based
processors. This stuff is very similar to Armada. They place "functors" that
have bounded per-record processing and bounded internal state at the ASU
(active storage unit). Since functors have bounded computation and state,
they have predictive behavior (used for load balancing and scheduling). The
extensions to TPIE include data aggregation primitives for sets (unordered
data), streams (sequential data), and arrays (random access data). They also
allow functors to process "packets" (groups of records) useful for
applications like a merge sort. The example applications include the standard
TPIE GIS app, along with a merge sort.
Abstract: In this paper we present the concept and
first prototyping results of a modular fault-tolerant distributed mass
storage architecture for large Linux PC clusters as they are deployed by the
upcoming particle physics experiments. The device masquerading technique
using an Enhanced Network Block Device (ENBD) enables local RAID over remote
disks as the key concept of the ClusterRAID system. The block level interface
to remote files, partitions or disks provided by the ENBD makes it possible
to use the standard Linux software RAID to add fault-tolerance to the system.
Preliminary performance measurements indicate that the latency is comparable
to a local hard drive. With four disks throughput rates of up to 55MB/s were
achieved with first prototypes for a RAID0 setup, and about 40MB/s for a
RAID5 setup.
Keywords: RAID, fault-tolerance, high-energy
physics, parallel I/O, pario-app, pario-bib
Keywords: RAID, disk array, parallel I/O,
pario-bib
Keywords: RAID, disk array, parallel I/O,
pario-bib
Comment: Cite wilkes:autoraid. A commercial RAID
box that transparently manages a hierarchy of two RAID systems, a RAID-1
mirrored system and a RAID-5 system. The goal is easy-to-use high
performance, and they appear to have achieved that goal. Data in current use
are kept in the RAID-1, and other data in RAID-5. This design gives
performance of RAID-1 with cost of RAID-5. They have a clever scheme for
spreading both RAIDs across most disks, including a hot spare. Dual
controllers, power supplies, fans, etc. The design is a fairly standard RAID
hardware controller, using standard SCSI disks, but with all the new tricks
done in controller software. The paper gives a few results from the prototype
hardware, and a lot of simulation results.
Keywords: RAID, disk array, parallel I/O,
pario-bib
Comment: Part of jin:io-book; reformatted version
of wilkes:autoraid.
Keywords: parallel I/O, distributed file system,
disk caching, heterogeneous file system, pario-bib
Comment: Hooks a heterogeneous set of storage
devices together over a fast interconnect, each with its own identical
processor. The whole would then act as a file server for a network. Data
storage devices would range from fast to slow (e.g. optical jukebox), varying
availability, etc.. Many ideas here but few concrete suggestions. Very little
mention of algorithms they might use to control the thing. See also
wilkes:datamesh1, cao:tickertaip, chao:datamesh, wilkes:houses,
wilkes:lessons.
Keywords: distributed file system, parallel I/O,
disk scheduling, disk layout, pario-bib
Comment: See chao:datamesh
Keywords: parallel I/O, RAID, disk striping,
pario-bib
Comment: An overview report on the DataMesh
project. It adds a little to the earlier reports such as wilkes:datamesh1. It
has some performance results from simulation.
Keywords: file system, distributed computing,
pario-bib
Comment: Same as wilkes:lessons. See that for
comments.
Keywords: file system, parallel I/O, RAID, disk
array, pario-bib
Comment: Invited speaker. Also appeared in ACM OSR
April 1993 (wilkes:houses). This gives his viewpoint that we should be
focusing more on architecture than on components, to design frameworks rather
than just individual policies and mechanisms. It also gives a quick overview
of DataMesh. For more DataMesh info, though, see cao:tickertaip,
chao:datamesh, wilkes:datamesh1, wilkes:datamesh, wilkes:houses.
Keywords: parallel I/O, pario-bib
Comment: Like the CUBIX interface, in some ways.
Meant for parallel access to non-striped (sequential) file. Self-describing
format so that the reader can read the formatting information and distribute
data accordingly.
\newcommand{\ceil}[1]{\lceil #1\rceil} \newcommand{\rank}[1]{\mathop{\rm
rank}\nolimits #1} This paper features in-place algorithms for commonly used
structured permutations. We have developed an asymptotically optimal
algorithm for performing BMMC (bit-matrix-multiply/complement) permutations
in place that requires at most $\frac{2N}{BD}\left(
2\ceil{\frac{\rank{γ}}{\lg (M/B)}} + \frac{7}{2}\right)$ parallel disk
accesses, as long as $M \geq 2BD$, where $N$ is the number of records in the
data set, $M$ is the number of records that can fit in memory, $D$ is the
number of disks, $B$ is the number of records in a block, and $γ$ is the
lower left $\lg (N/B) \times \lg B$ submatrix of the characteristic matrix
for the permutation. This algorithm uses $N+M$ records of disk storage and
requires only a constant factor more parallel disk accesses and insignificant
additional computation than a previously published asymptotically optimal
algorithm that uses $2N$ records of disk storage. We also give
algorithms to perform mesh and torus permutations on a $d$-dimensional mesh.
The in-place algorithm for mesh permutations requires at most $3\ceil{N/BD}$
parallel I/Os and the in-place algorithm for torus permutations uses at most
$4dN/BD$ parallel I/Os. The algorithms for mesh and torus permutations
require no extra disk space as long as the memory size $M$ is at least $3BD$.
The torus algorithm improves upon the previous best algorithm in terms of
both time and space. Abstract: The ability to perform permutations of
large data sets in place reduces the amount of necessary available disk
storage. The simplest way to perform a permutation often is to read the
records of a data set from a source portion of data storage, permute them in
memory, and write them to a separate target portion of the same size. It can
be quite expensive, however, to provide disk storage that is twice the size
of very large data sets. Permuting in place reduces the expense by using only
a small amount of extra disk storage beyond the size of the data set.
Keywords: parallel I/O, parallel I/O algorithm,
permutation, out-of-core, pario-bib
Keywords: MPI I/O, parallel file system, pario-bib
Comment: They describe the port of MPI I/O to the
Sun Parallel File system (a direct descendent of Galley).
Keywords: MPI I/O, parallel I/O, multiprocessor
file system interface, pario-bib
Keywords: parallel I/O, hypercube, parallel file
system, pario-bib
Comment: Concrete system for the hypercube. Files
resident on one disk only. Little support for cooperation except for
sequentialized access to parts of the file, or broadcast. No mention of
random-access files. I/O nodes are distinguished from computation nodes. I/O
nodes have separate comm. network. No parallel access. I/O hooked to
front-end too.
Keywords: parallel I/O, video server, multimedia,
pario-bib
Keywords: I/O benchmark, transaction processing,
pario-bib
Comment: Not about parallel I/O, but see
olson:random. Defines a new I/O benchmark that is fairly system-independent.
Focus is for transaction processing systems. Cranks up many tasks (users) all
doing repetitive read/writes for a specified time, using optional locking,
and optional computation. Whole suite of results for comparison with others.
See also chen:iobench.
Keywords: parallel I/O, pario-bib
Comment: A brief introduction to the topic of
parallel I/O (what, why, current research), followed by a roundtable
discussion among the authors of the papers in womble:special-issue. The
discussion focused on three questions: 1) What are the biggest gaps in
current I/O services? 2) Why have vendors failed to adopt new file system
technologies? 3) How much direct low-level control over I/O resources should
be given to the users and why?
Abstract: Parallel computers are becoming more
powerful and more complex in response to the demand for computing power by
scientists and engineers. Inevitably, new and more complex I/O systems will
be developed for these systems. In particular we believe that the I/O system
must provide the programmer with the ability to explicitly manage storage
(despite the trend toward complex parallel file systems and caching schemes).
One method of doing so is to have a partitioned secondary storage in which
each processor owns a logical disk. Along with operating system enhancements
which allow overheads such as buffer copying to be avoided and libraries to
support optimal remapping of data, this sort of I/O system meets the needs of
high performance computing.
Keywords: parallel I/O, parallel file system,
pario-bib
Comment: They argue that it is important to allow
the programmer to explicitly control their storage in some way. In
particular, they advocate the Partitioned Secondary Storage (PSS) model, in
which each processor has its own logical disk, rather than using a parallel
file system (PFS) which automatically stripes a linear file across many
disks. Basically, programmer knows best. Of course, libraries can help. They
note that you will often need data in a different format than it comes, and
may need it output in a different format; so, permutation algorithms are
needed. Also important to be able to overlap computation with I/O. They use
LU factorization as an example, and give an algorithm. On the nCUBE with the
PUMA operating system, they get good performance. See womble:pario.
Abstract: The solution of Grand Challenge Problems
will require computations which are too large to fit in the memories of even
the largest machines. Inevitably, new designs of I/O systems will be
necessary to support them. Through our implementations of an out-of-core LU
factorization we have learned several important lessons about what I/O
systems should be like. In particular we believe that the I/O system must
provide the programmer with the ability to explicitly manage storage. One
method of doing so is to have a partitioned secondary storage in which each
processor owns a logical disk. Along with operating system enhancements which
allow overheads such as buffer copying to be avoided, this sort of I/O system
meets the needs of high performance computing.
Keywords: parallel I/O, out-of-core, parallel
algorithm, scientific computing, multiprocessor file system, pario-bib
Comment: See womble:outofcore. See thakur:runtime,
kotz:lu, brunet:factor for other out-of-core LU results.
Keywords: parallel I/O, pario-bib
Comment: A one-page introduction to this special
issue of Parallel Computing, which includes many papers about parallel I/O.
See also womble:intro, nieuwejaar:jgalley, moore:ddio, barve:jmergesort,
miller:jrama, schwabe:jlayouts, parsons:templates, cormen:early-vic,
carretero:performance,
Abstract: We describe a benchmark problem, based
on the Block-Tridiagonal (BT) problem of the NAS Parallel Benchmarks (NPB),
which is used to test the output capabilities of high-performance computing
systems, especially parallel systems. We also present a source code
implementation of the benchmark, called NPBIO2.4-MPI, based on the MPI
implementation of the NPB, using a variety of ways to write the computed
solutions to file.
Keywords: parallel I/O benchmarks, block
tridiagonal, pario-app, pario-bib
Keywords: parallel I/O architecture, scientific
visualization, pario-bib
Comment: This paper is interesting for its
impressive usage of RAIDs and parallel networks to support scientific
visualization. In particular, the proposed Gigawall (a 10-foot by 6-foot
gigapixel-per-second display) is run by 24 SGI processors and 32 9-disk
RAIDs, connected to an MPP of some kind through an ATM switch. 512 GBytes of
storage, playable at 450 MBytes per second, for 19 minutes of animation.
Abstract: We present a fundamental improvement of
the generic techniques for non-contiguous file access in MPI-IO. The
improvement consists in the replacement of the conventional data management
algorithms based on a representation of the non-contiguous fileview as a list
of (offset, length) tuples. The improvement is termed listless i/o as it
instead makes use of space- and time-efficient datatype handling
functionality that is completely free of lists for processing non-contiguous
data in the file or in memory. Listless i/o has been implemented for both
independent and collective file accesses and improves access performance by
increasing the data throughput between user buffers and file buffers.
Additionally, it reduces the memory footprint of the process performing
non-contiguous I/O. In this paper we give results for a synthetic benchmark
on a PC cluster using different file systems. We demonstrate improvements in
I/O bandwidth that exceed a factor of 10.
Keywords: access patterns, MPI-IO, listless I/O,
pario-bib
Comment: Also see worringen:non-contiguous
Abstract: Many applications of parallel I/O
perform non-contiguous file accesses, but only few file system interfaces
support non-contiguous access. In contrast, the most commonly used parallel
programming interface, MPI, supports parallel I/O through its MPI-IO
interface. Within this interface, non-contiguous accesses are supported by
the use of derived MPI datatypes. Unfortunately, current MPI-IO
implementations suffer from low performance of such non-contiguous accesses
when compared to the performance of the storage system for contiguous
accesses although a considerable amount of work has been done in this area.
In this paper we analyze an important bottleneck in current implementations
of MPI-IO, and present a new technique termed listless i/o to perform
non-contiguous access with MPI-IO. On the NEC SX-series of parallel vector
computers, listless i/o is able to increase the bandwidth for non-contiguous
file access by sometimes more than a factor of 500 when compared to the
traditional approach.
Keywords: parallel I/O interface, file access
patterns, pario-bib
Comment: published on the web
Abstract: This paper presents a new approach
towards parallel I/O for message-passing (MPI) applications on clusters built
with commodity hardware and an SCI interconnect: instead of using the classic
scheme of clients and a number of servers communicating via TCP/IP, a pure
peer-to-peer communication topology based on efficient use of the underlying
SCI interconnect is presented. Every process of the MPI application is client
as well as server for I/O operations. This allows for a maximum of locality
in file access, while the accesses to remote portions of the distributed file
are performed via distributed shared memory techniques. A server is only
required to manage the initial distribution of the file fragments between the
participating nodes and to provide services like external access and
redundancy. (5 refs.)
Keywords: parallel I/O, MPI-IO, SCI connected
clusters, pario-bib
Comment: Short paper and a poster. Poster
URL=
Abstract: Noncontiguous I/O access is the main
access pattern in many scientific applications. Noncontiguity exists both in
access to files and in access to target memory regions on the client. This
characteristic imposes a requirement of native noncontiguous I/O access
support in cluster file systems for high performance. In this paper we
address noncontiguous data transmission between the client and the I/O server
in cluster file systems over a high performance network. We propose a novel
approach, RDMA Gather/Scatter, to transfer noncontiguous data for such I/O
accesses. We also propose a new scheme, Optimistic Group Registration, to
reduce memory registration costs associated with this approach. We have
designed and incorporated this approach in a version of PVFS over InfiniBand.
Through a range of PVFS and MPI-IO micro-benchmarks, and the NAS BTIO
benchmark, we demonstrate that our approach attains significant performance
gains compared to other existing approaches.
Keywords: noncontiguous access patterns, PVFS,
Infiniband, RDMA, pario-bib
Keywords: disk prefetching, parallel I/O, disk
caching, sorting, pario-bib
Comment: They discuss prefetching and caching in
database machines where mergesorts merge several input streams, each from its
own disk, to one output stream, to its own disk. There are concurrent merges
going on. A merge can cause thrashing when writes grab a clean buffer that
holds an unused prefetch, thus forcing the block to later be read again. They
consider several policies to handle this, but it seemed to me like they
missed an obvious alternative, that may have been better: whenever you need a
clean buffer to write into, but all the clean buffers hold unused-prefetched
blocks, stall and wait while the dirty blocks are flushed (presumably started
earlier when the clean-block count got too low). It seems better to wait for
a half-finished write than to toss out a prefetched block and later have to
read it again. Their simulations show that their techniques help a lot.
Abstract: A comprehensive study of the whole
petabyte-scale archival data of astronomical observatories has a possibility
of new science and new knowledge in the field, while it was not feasible so
far due to lack of enough data analysis environment. The Grid Datafarm
architecture is designed for global petabyte-scale data-intensive computing,
which provides a Grid file system with file replica management for fault
tolerance and load balancing, and parallel and distributed data computing
support for a set of files, to meet with the requirements of the
comprehensive study of the whole archival data. In the paper, we discuss
about worldwide parallel and distributed data analysis in the observational
astronomical field The archival data is stored, replicated and dispersed in a
Gfarm file system. All the astronomical data analysis tools successfully
access files in Gfarm file system without any code modification, using a
syscall hooking library regardless of file replica locations. Performance
evaluation of the parallel data analysis in several ways shows file-affinity
process scheduling plays an essential role for scalable and efficient
parallel file I/O performance. A data calibration tools shows scalable file
I/O performance, and achieved the file I/O performance of 5.9 GB/sec and 4.0
GB/sec for reading and writing FITS files, respectively, using 30 cluster
nodes (60 CPUs). On-demand file replica creation mitigates the overhead of
access concentration. Another tool shows the performance improvement at a
factor of six for reading a shared file by creating file replicas.
Keywords: grid, grid datafarm, astronomical data,
pario-app, pario-bib
Abstract: Grid is the largest advance of network
after Internet since the Grid System provides a specialty that can be used
popularly and effectively. However, it is a challenge to the consistency and
community of use on the data storages space of a Grid System. Therefore, the
problem of application for the Computational Grid and Data Grid is more
important. It can set up a usability, expandability, high operation
capability, and large memory space in Grid with the Cluster system and
parallel technique in order to solve the problem. In this paper, we provided
a Grid with high operation capability and higher memories to solve the
problem. As to the Grid setting, we take use of the Cluster computing to
increase the operation effect for computing, and a PVFS2 with more storages
effect for data. It can supply a quite correct platform for Grid user whether
for large data access or huge operation.
Keywords: grid I/O, PVFS2, cluster file system,
pario-bib
Keywords: parallel I/O, pario-bib
Comment: They propose to link a set of disks with
its own interconnect, e.g., a torus, to allow the disks to communicate to
compute multi-dimensional parity and to respond to disk failures, without
using the primary interconnect of the multiprocessor or distributed system.
In this sense it is reminiscent of TickerTAIP or DataMesh.
Abstract: Scalable disk systems are required to
implement well-balanced computer systems. We have proposed DR-nets,
Data-Reconstruction networks, to construct the scalable parallel disk systems
with high reliability. Each node of a DR-net has disks, and is connected by
links to form an interconnection network. To realize the high reliability,
nodes in a sub-network of the interconnection network organize a group of
parity calculation proposed for RAIDs. Inter-node communication for
calculating parity keeps the locality of data transfer, and it inhibits
bottlenecks from occurring, even if the size of the network becomes very
large. We have developed an experimental system using Transputers. In this
chapter, we provide execution models for estimating the response time and
throughput of DR-nets, and compare them to experimental results. We also
discuss the reliability of the DR-nets and RAIDs.
Keywords: parallel I/O architecture, disk array,
pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Keywords: parallel I/O, disk array, RAID, mobile
computing, pario-bib
Comment: low-power, highly available disk arrays
for mobile computers.
Abstract: A performance model for a parallel I/O
system is essential for detailed performance analyses, automatic performance
optimization of I/O request handling, and potential performance bottleneck
identification. Yet how to build a portable performance model for parallel
I/O system is an open problem. In this paper, we present a machine-learning
approach to automatic performance modeling for parallel I/O systems. Our
approach is based on the use of a platform- independent performance
metamodel, which is a radial basis function neural network. Given training
data, the metamodel generates a performance model automatically and
efficiently for a parallel I/O system on a given platform. Experiments
suggest that our goal of having the generated model provide accurate
performance predictions is attainable, for the parallel I/O library that
served as our experimental testbed on an IBM SP. This suggests that it is
possible to model parallel I/O system performance automatically and portably,
and perhaps to model a broader class of storage systems as well.
Keywords: parallel I/O, performance model,
pario-bib
Abstract: A variety of performance-enhancing
techniques, such as striping, mirroring, and rotational data replication,
exist in the disk array literature. Given a fixed budget of disks, one must
intelligently choose what combination of these techniques to employ. In this
paper, we present a way of designing disk arrays that can flexibly and
systematically reduce seek and rotational delay in a balanced manner. We give
analytical models that can guide an array designer towards optimal
configurations by considering both disk and workload characteristics. We have
implemented a prototype disk array that incorporates the configuration
models. In the process, we have also developed a robust disk head position
prediction mechanism without any hardware support. The resulting prototype
demonstrates the effectiveness of the configuration models.
Keywords: disk array, file system, parallel I/O,
pario-bib
Keywords: parallel I/O, disk array, database, disk
reorganization, pario-bib
Keywords: unix, parallel operating system,
multiprocessor file system, pario-bib
Comment: Describing the changes to OSF/1 to make
OSF/1 AD TNC, primarily intended for NORMA MIMD multicomputers. Enhancements
include a new file system, distributed implementation of sockets, and process
management. The file system still has traditional file systems, each in its
own partition, with a global name space built by mounting file systems on
each other. The change is that mounts can be remote, ie, managed by a
different file server on another node. They plan to use prefix tables for
pathname translation (welch:prefix,nelson:sprite). They use a token-based
protocol to provide atomicity of read and write calls, and to maintain
consistency of client-node caches. See also roy:unixfile. Process
enhancements include a new SIGMIGRATE, rfork(), and rforkmulti().
Abstract: This paper introduces a new benchmark
tool for evaluating performance and availability (performability) of
networked storage systems, specifically storage area network (SAN) that is
intended for providing block-level data storage with high performance and
availability. The new benchmark tool, named N-SPEK (Networked-Storage
Performability Evaluation Kernel module), consists of a controller, several
workers, one or more probers, and several fault injection modules. N-SPEK is
highly accurate and efficient since it runs at kernel level and eliminates
skews and overheads caused by file systems. It allows a SAN architect to
generate configurable storage workloads to the SAN under test and to inject
different faults into various SAN components such as network devices, storage
devices, and controllers. Available performances under different workloads
and failure conditions are dynamically collected and recorded in the N-SPEK
over a spectrum of time. To demonstrate its functionality, we apply N-SPEK to
evaluate the performability of a specific iSCSI-based SAN under Linux
environment. Our experiments show that N-SPEK not only efficiently generates
quantitative performability results but also reveals a few optimization
opportunities for future iSCSI implementations.
Keywords: benchmarking, performance, block-level
access, pario-bib
Abstract: As the size of cluster becomes larger,
the process ability of a cluster increases rapidly. Users will exploit this
increased power to run scientific, physical and multimedia applications.
These kinds of data-intensive applications require high performance storage
subsystem. Parallel storage system such as RAID is widely used in today's
clusters. In this paper, we bring out a "greedy" I/O scheduling method that
utilizes Scatter and Gather operations inside the PCI-SCSI adapter to combine
as many I/O operations within the same disk as possible. In this way we
reduce the numbers of I/O operations and improve the performance of the whole
storage system. After analyzing RAID control strategy, we find out that I/O
commands' combination may also bring up data movement in memory and this kind
of movement will increase the system's overhead. The experiment results in
our real time operating environment show that a better performance can be
achieved. The longer the data length is, the better improvement we can get,
in some case, we can even get over 40% enhancement.
Keywords: parallel I/O, disk scheduling, pario-bib
Abstract: Out-of-core applications perform poorly
in paged virtual memory (VM) systems because demand paging involves slow disk
I/O accesses. Much research has been done on reducing the I/O overhead in
such applications by either reducing the number of I/Os or lowering the cost
of each I/O operation. In this paper, we investigate a method that combines
fine-grained threading with a memory server model to improve the performance
of out-of-core applications on multicomputers. The memory server model
decreases the average cost of I/O operations by paging to remote memory,
while the fine-grained thread scheduling reduces the number of I/O accesses
by improving the data locality of applications. We have evaluated this method
on an Intel Paragon with 7 applications. Our results show that the memory
server system performs better than the VM disk paging by a factor of 5 for
sequential applications and a factor of 1.5 to 2.2 for parallel applications.
The fine-grained threading alone improves the VM disk paging performance by a
factor of 10 and 1.2 to 3 respectively for sequential and parallel
applications. Overall, the combination of these two techniques outperforms
the VM disk paging by more than a factor of 12 for sequential applications
and a factor of 3 to 6 for parallel applications.
Keywords: threads, scheduling, memory, out-of-core
application, parallel I/O, pario-bib
Abstract: In this paper we analyze the I/O access
patterns of a widely-used biological sequence search tool and implement two
variations that employ parallel-I/O for data access based on PVFS (Parallel
Virtual File System) and CEFT-PVFS (Cost-Effective Fault-Tolerant PVFS).
Experiments show that the two variations outperform the original tool when
equal or even fewer storage devices are used in the former. It is also found
that although the performance of the two variations improves consistently
when initially increasing the number of servers, this performance gain from
parallel I/O becomes insignificant with further increase in server number. We
examine the effectiveness of two read performance optimization techniques in
CEFT-PVFS by using this tool as a benchmark. Performance results indicate:
(1) Doubling the degree of parallelism boosts the read performance to
approach that of PVFS; (2) Skipping hotspots can substantially improve the
I/O performance when the load on data servers is highly imbalanced. The I/O
resource contention due to the sharing of server nodes by multiple
applications in a cluster has been shown to degrade the performance of the
original tool and the variation based on PVFS by up to 10 and 21 folds,
respectively; whereas, the variation based on CEFT-PVFS only suffered a
two-fold performance degradation.
Keywords: BLAST, CEFT-PVFS, parallel I/O, PVFS,
application, characterization, I/O access patterns, biology application,
pario-app, pario-bib
Additionally, when the data servers, which typically are also computational
nodes in a cluster environment, are loaded in an unbalanced way by
applications running in the cluster, the read performance of PVFS will be
degraded significantly. On the contrary, in the CEFT-PVFS, a heavily loaded
data server can be skipped and all the desired data is read from its
mirroring node. Thus the performance will not be affected unless both the
server node and its mirroring node are heavily loaded. Abstract: Due to the ever-widening performance gap
between processors and disks, I/O operations tend to become the major
performance bottleneck of data-intensive applications on modern clusters. If
all the existing disks on the nodes of a cluster are connected together to
establish high performance parallel storage systems, the cluster's overall
performance can be boosted at no additional cost. CEFT-PVFS (a RAID 10 style
parallel file system that extends the original PVFS), as one such system,
divides the cluster nodes into two groups, stripes the data across one group
in a round-robin fashion, and then duplicates the same data to the other
group to provide storage service of high performance and high reliability.
Previous research has shown that the system reliability is improved by a
factor of more than 40 with mirroring while maintaining a comparable write
performance. This paper presents another benefit of CEFT-PVFS in which the
aggregate peak read performance can be improved by as much as 100% over that
of the original PVFS by exploiting the increased parallelism.
Keywords: parallel I/O, fault-tolerance, read
performance, parallel file system, PVFS, pario-bib