ACM Computing Surveys 28A(4), December 1996, Copyright © 1996 by the Association for Computing Machinery, Inc. See the permissions statement below.

Strategic Directions in Computing Research

Working Group on Storage I/O for Large-Scale Computing

Large-Scale File Systems with the Flexibility of Databases

Alok Choudhary

Northwestern University, Department of Electrical and Computer Engineering
Technological Institute, 2145 Sheridan Road, Evanston, IL, 60208-3118, USA,

David Kotz

Dartmouth College, Department of Computer Science
6211 Sudikoff Laboratory, Hanover, NH, 03755-3510, USA

Abstract: We note that large-scale computing includes many applications with intensive I/O demands. A data-storage system for such applications must address two issues: locating the appropriate data set, and accessing the contents of the data set. Today, there are two extreme models of data location and management: 1) file systems, which can be fast but which require a user to manage the structure of the file-name space and, often, of the file contents; and 2) object-oriented-database systems, in which even the smallest granule of data is stored as an object with associated access methods, which is very flexible but often slow. We propose a solution that may provide the performance of file systems with the flexibility of object databases.

Categories and Subject Descriptors: B.4.3 [Input-Output and Data Communications]: Interconnections - Interfaces; C.2.4 [Computer Communication Networks]: Distributed Systems - Network Operating Systems; D.4.2 [Operating Systems]: Storage Management - Storage Hierarchies; D.4.3 [Operating Systems]: File Systems Management - Distributed File Systems; D.3.3 [Programming Languages]: Language Constructs and Features - Data Type and Structures; E.2 [Data Storage Representation]: Composite Structures; E.5 [Files]: Optimizations; H.2 [Database management]: Systems.

General Terms: Parallel I/O, File System, Database

Additional Key Words and Phrases: High-Performance I/O, Parallel File System, Runtime Systems, Database Management, Storage Systems, Compilation

1 Applications

Large-scale computing includes many application areas with intensive I/O demands, including scientific computing, data mining, digital libraries, decision support, and data visualization. The data comes in many forms (matrices, finite-element meshes, transaction logs, customer databases, financial records, literature, music, images, video), much of which has a complex structure. There is structure both in the format of the data object (a matrix, for example, or a video clip), as well as in the collection of objects. All of the large-scale computing applications mentioned above will have to manage with tremendous collections of large objects.

2 Architectural assumptions

We assume that large-scale computing environments will include a range of systems from personal computers and workstations to high-performance computers, often geographically distributed, and interconnected with high-speed networks. Consumers of the data and information may not know how the data is stored or managed, or where the data is stored. A distributed set of mass-storage servers will collectively provide the primary storage for large data collections. The supercomputers and workstations will have internal storage systems, probably but not necessarily based on disks attached to all nodes of parallel computers (depending on the prospects for network-attached peripherals). Supercomputers, at least, will have multiple network connections, to avoid a bottleneck at a single interface node.

Memory capacity, at all levels from RAM to tape robots, is increasing exponentially, a trend that is expected to continue for at least another decade. Latency and bandwidth, however, are improving very slowly, thus widening the gap between computation and I/O speeds. We are therefore faced with the challenge of managing ever-larger and ever-slower storage (when compared to high-performance CPUs), for both high performance and ease of use. There are as yet no clear solutions.

3 Database-style metadata management

A data-storage system must address two issues: locating the appropriate data set, and accessing the contents of the data set. Today, there are two extreme models of data location and management: 1) file systems, which can be fast but which require a user to manage the structure of the file-name space and, often, of the file contents; and 2) object-oriented-database systems, in which even the smallest granule of data is stored as an object with associated access methods, which is very flexible but often slow.

We propose a solution that may provide the performance of file systems with the flexibility of object databases.

Locating a data set

We believe that the traditional data-storage abstraction of flat files, using short names in a hierarchical name space, is doomed to failure in future large-scale storage systems. Consider a scientific computing application such as global climate modeling. With a traditional file system, the data from each pass of each satellite may be stored in a single file. There are tens of satellites, several passes per day, hundreds of days per year, and tens of years, leading to thousands of files. There are then many derivative files for each raw-data file. There are also several output files from each run of the climate models.

A database may be more appropriate than a traditional file system to manage such a data collection. The programmer can then use database queries to select the appropriate data set. In essence, we propose to replace the file-name space with a database structure; unlike a traditional database, however, the "records" might be matrices, images, movies, and other complex data objects. The scientific-database community has been proposing this idea for some time; can it be generalized?

Accessing the data set

Given the rich structure and diversity of the stored data objects, programmers need a richer data-access interface than the traditional flat file. Indeed, we expect to see three levels of interface:
  1. For system programmers: a low-level interface for writing datatype-specific and application-specific access methods (see below). Provides low-level access to some information, such as locality and layout.
  2. For application programmers: a strided read/write/seek interface, plus any datatype-specific or application-specific access methods.
  3. For end-users: generally application specific. A new "shell", for metadata access, allows them to select a data set and pass it on to an application for processing.

Each data type or application may have customized access methods that provide higher-level abstractions to the application programmer, and that can take advantage of low-level information to optimize performance. Type-specific access methods might provide access to irregular structures, provide hidden compression or format conversion, and so forth.

The code for some access methods may run on the I/O servers, where it can interact with the I/O services and devices in a tightly coupled fashion. This idea puts application- or type-specific code on both ends of the network between the application and the I/O device, leading to better performance.

We have traditionally had an abstraction for disk devices (a linear sequence of sectors) that aids development of layout- and access-optimization algorithms. Given an appropriate abstraction for locality of disks, relative to each other and to the processors, can access methods optimize inter-disk layout (declustering) and access patterns?

Tertiary storage

Tertiary storage systems will clearly be an integral part of any large storage hierarchy. Most current data-management systems, however, do not transparently move data between secondary and tertiary storage. And while on-demand, transparent data migration is important, for performance reasons data migration must overlap data processing. To make this overlap possible, the system must accept both migration "hints" and a mechanism for programmers to learn about the location of their data, especially when writing access methods that might optimize migration decisions.

Networked storage systems

Increasingly fast networks are making it possible to spread large data sets over geographically distributed sites for processing and consumption. Nonetheless, striping data transfers across network interfaces, or networks themselves, may be necessary to achieve the necessary concurrency and redundancy. The efficient management, naming, caching, and delivery of data over networks will be a challenge in an heterogeneous distributed environment. Again, hints can be used to support automatic management, but low-level control should be possible to support sophisticated programmers.

Research recommendations

It is clear that today's solutions are not adequate for tomorrow's storage systems. We suggest intense research in the following areas:


We are indebted to many ideas from the research literature in forming our position. Not all of the above ideas are new or our own.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or

Last modified: Fri Dec 6 20:09:13 1996
David Kotz