ACM Computing Surveys
28A(4), December 1996,
http://www.acm.org/surveys/1996/ChoudharyFile/. Copyright
© 1996 by the Association for Computing Machinery, Inc. See
the permissions statement below.
Working
Group on Storage I/O for Large-Scale Computing
Large-Scale File Systems with the Flexibility of Databases
Alok Choudhary
Northwestern University,
Department of Electrical and Computer Engineering
Technological Institute,
2145 Sheridan Road,
Evanston, IL, 60208-3118, USA
choudhar@ece.nwu.edu,
http://web.eecs.nwu.edu/~choudhar/
David Kotz
Dartmouth College,
Department of Computer
Science
6211 Sudikoff Laboratory, Hanover, NH, 03755-3510, USA
Abstract:
We note that large-scale computing includes many applications with
intensive I/O demands. A data-storage system for such applications
must address two issues: locating the appropriate data set, and
accessing the contents of the data set. Today, there are two extreme
models of data location and management: 1) file systems, which can be
fast but which require a user to manage the structure of the file-name
space and, often, of the file contents; and 2)
object-oriented-database systems, in which even the smallest granule
of data is stored as an object with associated access methods, which
is very flexible but often slow. We propose a solution that may
provide the performance of file systems with the flexibility of object
databases.
Categories and Subject Descriptors:
B.4.3 [Input-Output and Data Communications]: Interconnections - Interfaces;
C.2.4 [Computer Communication Networks]: Distributed Systems - Network Operating Systems;
D.4.2 [Operating Systems]: Storage Management - Storage Hierarchies;
D.4.3 [Operating Systems]: File Systems Management - Distributed File Systems;
D.3.3 [Programming Languages]: Language Constructs and Features - Data Type and Structures;
E.2 [Data Storage Representation]: Composite Structures;
E.5 [Files]: Optimizations;
H.2 [Database management]: Systems.
General Terms: Parallel I/O, File System, Database
Additional Key Words and Phrases:
High-Performance I/O, Parallel File System, Runtime Systems, Database
Management, Storage Systems, Compilation
1 Applications
Large-scale computing includes many application areas with intensive
I/O demands, including scientific computing, data mining, digital
libraries, decision support, and data visualization. The data comes
in many forms (matrices, finite-element meshes, transaction logs,
customer databases, financial records, literature, music, images,
video), much of which has a complex structure. There is structure
both in the format of the data object (a matrix, for example, or a
video clip), as well as in the collection of objects. All of
the large-scale computing applications mentioned above will have to
manage with tremendous collections of large objects.
2 Architectural assumptions
We assume that large-scale computing environments will include a range
of systems from personal computers and workstations to
high-performance computers, often geographically distributed, and
interconnected with high-speed networks. Consumers of the data and
information may not know how the data is stored or managed, or where
the data is stored. A distributed set of mass-storage servers will
collectively provide the primary storage for large data collections.
The supercomputers and workstations will have internal storage
systems, probably but not necessarily based on disks attached to all
nodes of parallel computers (depending on the prospects for
network-attached peripherals). Supercomputers, at least, will have
multiple network connections, to avoid a bottleneck at a single
interface node.
Memory capacity, at all levels from RAM to tape robots, is
increasing exponentially, a trend that is expected to continue for at
least another decade. Latency and bandwidth, however, are improving
very slowly, thus widening the gap between computation and I/O speeds.
We are therefore faced with the challenge of managing ever-larger and
ever-slower storage (when compared to high-performance CPUs), for both
high performance and ease of use. There are as yet no clear
solutions.
3 Database-style metadata management
A data-storage system must address two issues: locating the
appropriate data set, and accessing the contents of the data set.
Today, there are two extreme models of data location and management:
1) file systems, which can be fast but which require a user to
manage the structure of the file-name space and, often, of the file
contents; and 2) object-oriented-database systems, in which even
the smallest granule of data is stored as an object with associated
access methods, which is very flexible but often slow.
We propose a solution that may provide the performance of file
systems with the flexibility of object databases.
Locating a data set
We believe that the traditional data-storage abstraction of flat
files, using short names in a hierarchical name space, is doomed to
failure in future large-scale storage systems. Consider a scientific
computing application such as global climate modeling. With a
traditional file system, the data from each pass of each satellite may
be stored in a single file. There are tens of satellites, several
passes per day, hundreds of days per year, and tens of years, leading
to thousands of files. There are then many derivative files for each
raw-data file. There are also several output files from each run of
the climate models.
A database may be more appropriate than a traditional file system
to manage such a data collection. The programmer can then use
database queries to select the appropriate data set. In essence, we
propose to replace the file-name space with a database structure;
unlike a traditional database, however, the "records" might be
matrices, images, movies, and other complex data objects. The
scientific-database community has been proposing this idea for some
time; can it be generalized?
Accessing the data set
Given the rich structure and diversity of the stored data objects,
programmers need a richer data-access interface than the traditional
flat file. Indeed, we expect to see three levels of interface:
- For system programmers: a low-level interface for writing
datatype-specific and application-specific access methods
(see below). Provides low-level access to some information,
such as locality and layout.
- For application programmers: a strided read/write/seek
interface, plus any datatype-specific or application-specific
access methods.
- For end-users: generally application specific.
A new "shell", for metadata access, allows them to
select a data set and pass it on to an application for processing.
Each data type or application may have customized access methods
that provide higher-level abstractions to the application programmer,
and that can take advantage of low-level information to optimize
performance. Type-specific access methods might provide access to
irregular structures, provide hidden compression or format conversion,
and so forth.
The code for some access methods may run on the I/O servers, where
it can interact with the I/O services and devices in a tightly coupled
fashion. This idea puts application- or type-specific code on both
ends of the network between the application and the I/O device,
leading to better performance.
We have traditionally had an abstraction for disk devices (a
linear sequence of sectors) that aids development of layout- and
access-optimization algorithms. Given an appropriate abstraction for
locality of disks, relative to each other and to the processors, can
access methods optimize inter-disk layout (declustering) and access
patterns?
Tertiary storage
Tertiary storage systems will clearly be an integral part of any large
storage hierarchy. Most current data-management systems, however, do
not transparently move data between secondary and tertiary storage.
And while on-demand, transparent data migration is important, for
performance reasons data migration must overlap data processing. To
make this overlap possible, the system must accept both migration
"hints" and a mechanism for programmers to learn about the location of
their data, especially when writing access methods that might optimize
migration decisions.
Networked storage systems
Increasingly fast networks are making it possible to spread large data
sets over geographically distributed sites for processing and
consumption. Nonetheless, striping data transfers across network
interfaces, or networks themselves, may be necessary to achieve the
necessary concurrency and redundancy. The efficient management,
naming, caching, and delivery of data over networks will be a
challenge in an heterogeneous distributed environment. Again, hints
can be used to support automatic management, but low-level control
should be possible to support sophisticated programmers.
Research recommendations
It is clear that today's solutions are not adequate for tomorrow's
storage systems. We suggest intense research in the following areas:
- What is the future architectural relationship between processor
nodes and storage devices, for example, within a parallel
supercomputer?
- Will huge RAM capacities change the traditional storage trade-offs?
- Can database techniques effectively replace file-system techniques
for managing the name space of large-scale storage systems?
- If a database represents the data collection,
what should the new end-user "shell" look like?
- Should files be typed? How should access methods be defined, and
associated with files or applications? How should access methods be
made available to, or extended by, the application programmer?
- If access methods run on the storage servers, what are the new
optimization opportunities? What are the risks?
- Is there an appropriate abstraction for locality of disks,
relative to each other and to the processors? Can this abstraction
be used to optimize inter-disk layout and access patterns?
- How should secondary-tertiary data migration be supported?
How can hints be specified? How can low-level control be provided?
- In what ways must a distributed mass storage system
be managed differently than a traditional distributed file or
database system?
- In what ways is it possible to treat distant storage systems as
a tertiary storage device?
Acknowledgements
We are indebted to many ideas from the research literature in forming
our position. Not all of the above ideas are new or our own.
Permission to make digital
or hard copies of part or all of this work for personal or classroom
use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for
components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, to
republish, to post on servers, or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from
Publications Dept, ACM Inc., fax +1 (212) 869-0481, or
permissions@acm.org.
Last modified: Fri Dec 6 20:09:13 1996
David Kotz