Parallel-I/O Anecdote

The code we're working on is probably too big (but not restricted), but I can describe what we're doing and what problems we've encountered. I should mention that I/O has been the single biggest obstacle in our efforts and we're still working on parts of it.

First, we are attempting to write a code which is very portable. The current version of the code runs on both a Cray and a CM-5 with almost no #ifdef's and only a few special subroutines to get around some Cray compiler problems (they still don't really have a good F90 compiler). The same code also supports message passing although that work isn't quite complete. Anyway, given the need for portability, we'd like to have an I/O scheme that also works well for all the models we're supporting.

We have not yet truly solved this problem. Currently we have a bunch of kludges that allow us to work around a horribly written I/O scheme. For example, the current version was written specifically to support the Cray and SSD issues. Restart files are read into huge buffer arrays and the code picks out the region of interest. Unfortunately, on the CM-5 the region of interest is the entire domain so we'd like to bypass the buffer garbage altogether.

I suspect you're more interested in issues about how we write the data to and from files. For the atmospheric model, we need to read and write restart files. We have one restart file with the basic fields, one for fixed tracer fields and one for user tracer fields. For example, the basic fields include all the three-d arrays for velocity fields, temperature, moisture as well as two-d arrays for things like surface pressure. Two time levels for each of these are stored (since we're using a second-order time scheme). These files also contain headers with various time and parameter information.

Some of the problems we've encountered:

Converting Cray restart files to 64-bit IEEE format. There are ways to convert binary files directly from Cray format to 32-bit IEEE. However there is no way to convert Cray binary files directly to 64-bit IEEE. We've had to write conversion programs to do this.
In order to write to the SDA (Scalable Disk Array) attached to the CM-5, we need to use CM-specific calls and you can not currently mix scalar data (headers) and parallel data (3-d fields) in the same file. We've had to strip the headers off and store them in separate files. Some of these problems will disappear in the next compiler release which is coming soon.

Follow-up:

I guess region was a bad choice of words. On smaller memory Crays (such as the YMP at GFDL) computations are typically done on one latitude at a time. The global fields are stored either on an SSD or in buffer arrays in core. The code grabs the info for the latitude of interest, plus two latitudes on either side for boundary values and computes everything for that particular latitude. The actual work is then done on arrays which have a latitude extent of only three. On the CM-5, we do them all at once. On message passing machines, we have added the capability to chunk the data in two directions so each processor gets some rectangular chunk of data to work on. As I said, we're still working on a better I/O scheme which will allow us to support all of these models without using big global buffer arrays.

So, if I understand correctly, a restart file represents the state of the model at the end of some timestep, and each file contains a series of scalar parameters and 3-d and 2-d matrices. And all you do with these files is to write them in their entirety, at the end of some timesteps, and then sometimes to read them back in, either in their entirety or to get only one region, whenever you need to restart the simulation.

Exactly.

When you need to get only one region (subdomain), what does that look like? A contiguous block from each matrix in the file? Some of the columns (or rows) of a matrix? Or something more complex?

As mentioned above, for the Cray the entire domain is read and stored on the SSD (or even in an in-core buffer array). One (actually three) latitude at a time is then copied and worked on. On the CM-5 we always work on the entire array. In a message passing model, we would read a specific retangular chunk of data and send it to the appropriate node (or ideally have each of the nodes read the correct chunk on its own).

Do you use any files for reading and writing scratch data, during the computation?

Not currently. The only other files besides restart files are files containing table info (only read once) and some log files in which we periodically dump diagnostic info in ascii form. Eventually, we may add capabilities for dumping animation frames. If external I/O were as fast as an SSD, then we might consider out-of-core solutions.

Sounds like one of the other big issues for you has been the I/O interface. I imagine this will take some time to straighten out, but I'll bet that Fortran will have a standardized interface (for better or worse) sooner than the rest of the world.

It looks like standard Fortran unformatted I/O will be supported by TMC in their next compiler release so we can use the same I/O statements whether we're reading/writing to a datavault, SDA, workstation or whatever. Ideally the storage medium would be transparent to the user.

-- anonymous 1993