File Structure in Galley

Most existing multiprocessor file systems are based on the conventional Unix-like file-system interface in which a file is seen as an addressable, linear sequence of bytes. The file system typically declusters files (ie, scatters the blocks of each file across multiple disks), allowing parallel access to the file. This parallel access reduces the effect of the bottleneck imposed by the relatively slow disk speed. Although the file is actually scattered across many disks, the underlying parallel structure of the file is hidden from the application.

Galley uses a more complex file model that should lead to greater flexibility and performance.

Subfiles

The linear model can give good performance when the request size generated by the application is larger than the declustering unit size, as multiple disks are being used in parallel. However, while the declustering unit size is frequently measured in kilobytes (eg, 4KB in Intel's CFS), our workload characterization studies show that the typical request size in a parallel application is much smaller: frequently under 200 bytes. This disparity means that most of the individual requests generated by parallel applications are not being executed in parallel. In the worst case, the compute processors in a parallel application may issue their requests in such a way that all of an application's processes may first attempt to access disk 0 simultaneously, then all attempt to access disk 1 simultaneously, and so on.

Another problem with the linear file model is that a dataset may have a natural, parallel mapping onto multiple disks that is not easily captured by the standard cyclic block-declustering schemes. An example of this sort of dataset may be seen in the Flexible Image Transport System (FITS) data format, which is used for astronomical data. A FITS file is organized as a series of records, each of which is the same size and contains a key (which may be composed of multiple fields) and one or more data elements. While there is a fairly straightforward mapping of a FITS dataset onto a linear file (simply listing the records), it is not clear in what order the records should be listed. Furthermore, it is not clear that blindly striping this list across multiple disks is the optimal approach in a parallel file system. Rather, one could distribute the data across the disks using the keys and knowledge about how the dataset will be used to determine a partitioning scheme that results in highly parallel access.

To address these problems, Galley does not automatically decluster an application's data. Instead, Galley provides applications with the ability to fully control this declustering according to their own needs. This control is particularly important when implementing I/O-optimal algorithms. Applications are also able to explicitly indicate which IOP they wish to access in each request. To allow this behavior, files are composed of one or more subfiles, each of which resides entirely on a single IOP, and which may be directly addressed by the application. This structure gives applications the ability both to control how the data is distributed across the IOPs, and to control the degree of parallelism exercised on every subsequent access.

Forks

Each subfile in Galley is structured as a collection of one or more independent forks. Each fork is a named, linear, addressable sequence of bytes, similar to a traditional Unix file. While the number of subfiles in a file is fixed at file-creation time, the number of forks in a subfile is not fixed; libraries and applications may add forks to, or remove forks from, a subfile at any time. The final, three-dimensional file structure is shown below. Note that there is no requirement that all subfiles have the same number of forks, or that all forks have the same size.

The use of forks allows further application-defined structuring. For example, if an application represents a physical space with two matrices, one containing temperatures and other pressures, the matrices could be stored in the same file (perhaps declustered across multiple subfiles) but in different forks. In this way, related information is stored logically together but is available independently. As a result, an application (e.g., a visualization tool) that is only interested in one of the two pieces of data, can access the relevant data easily and efficiently.

In addition to data in the traditional sense, the availability of multiple forks also allows an application to store persistent, application-defined meta-data independently of the data proper. One example of a library that would be able to make use of multiple forks for metadata is a compression library similar to that described in a file 'compression.ps' once found at http://bunny.cs.uiuc.edu/CADR/pubs/. The library could store compressed data chunks in one fork and directory information in another.

Another potential use for forks can be seen in Stream*, a parallel file abstraction for the data-parallel language, C*. Briefly, Stream* divides a file into three distinct segments, each of which corresponds to a particular set of access semantics. While the current implementation of Stream* stores all the segments in a single file, one could use a different fork for each segment. In addition to the raw data, Stream* maintains several kinds of metadata, which are currently stored in three different files: .meta, .first , and .dir . In a Galley-based implementation of Stream*, it would be natural to store this metadata in separate forks rather than separate files.

Finally, a fork could also be used to store the code (Python, Java, standard object code, etc.) necessary to access the data stored in the other forks. This code could be linked with a user's application at runtime, or possibly loaded and run on the I/O node (see our white paper in ACM Operating Systems Review).


Nils A. Nieuwejaar. Last modified Friday, May 30, 1996.