Parallel-I/O Anecdote

Take as a basic assumption that I have an algorithm which needs to do a fair bit of I/O. I want to produce a parallel version of this algorithm, and I'm interested in ideas for handling the I/O. I'm not sure if there's been much done about this -- all the discussions I hear seem to ignore doing any kind of I/O in parallel.

By the way, I'm mainly concerned about MIMD-shared memory machines -- I realize that each class of machines presents different problems and solutions.

To provide fuel for the discussion, here are some of my thoughts:

1) Most machines do not have an I/O channel for each process, therefore having every process initiate all of the I/O requests it wants will really kill performance -- all processes will end up waiting for the disk. Is this a real problem? In a single-user environment? Multi-user?

2) One solution might be to have one process make all of the I/O calls, storing the data in a large buffer. The other processes would "read" or "write" data from the large buffer. What sort of machines will this work on? How general is it?

3) What about cases where the number of CPUs working on your code is indeterminant at any given time (Cray autotasking, for example)?

more info...

Well, in my group we do a lot of high-quality calculation of the electronic structure of molecules. This means solving an approximation to the Schrodinger equation, and at the level we do them, calculations can easily run several hours on (one processor of) a Cray YMP and may require a gigabyte or more of scratch disk space.

What makes these calculations so intensive it that we are essentially solving a non-linear equation with tens to hundreds of millions of parameters. And in general, we can't expect to store all of these and the necessary intermediates in core, even on a large-memory machine.

The first step in a great variety of these calculations is to solve a more approximate equation known as the Hartree-Fock or Self-Consistent Field problem. Here you're typically determining only hundreds to thousands of parameters. I'll give you a rough description of how it works.

The important quantities are the Fock matrix, density matrix and the two- electron integrals. The dimensions of these are based on the number of "basis functions" chosen (by the user) to model the atomic orbitals of the atoms in the molecule. This is typically O(100) these days, though there have been some calculations as large as O(1000) basis functions. The Fock and density matrices are just rank-2 matrices. The two-electron integrals are a rank-4 tensor.

The two-electron integrals are usually computed in advance, by another program, and stored on disk with their identifying indices. Only non- zero integrals are stored, and they are in no particular order as far as the Hartree-Fock program is concerned. The I/O intensive step in the procedure is constructing the Fock matrix by contracting the integral tensor with the density matrix. Due to permutational symmetries in the indices, each integral can contribute to several different Fock matrix elements with different prefactors. The computational work, therefore, involved unpacking the indices (they are usually condensed into one or two words via bit manipulations, to save space), determining the permutation symmetry of the particular indices, and accumulating the appropriate contributions into the Fock matrix.

Once the Fock matrix is constructed, it is diagonalized and a new density matrix is computed, and the next iteration begins. Convergence is attained when the new density matrix is the same as the previous one. The steps I glossed over mainly involve matrix multiplies with various other quantities which are not central to the I/O problem -- they are most easily parallelized by someone (preferably the vendor, from our view) writing an optimized BLAS library for their machine.

The logical way to parallelize this operation is to let each processor read part of the integrals and compute a partial Fock matrix, which are then added together before the rest of the procedure takes place.

I should complicate the discussion at this point by mentioning that people have other ways of processing the integrals -- developed long before anyone ever conceived of parallel machines. One is to insert a step where the two-electron integrals are sorted and are formed into certain combinations that eliminate the permutational symmetry concerns later. This is typically a disk-to-disk (rather than disk-to-memory) procedure. It makes the Fock matrix easier to construct computationally, but the I/O works exactly the same way, so *you* have the same problems.

The other way, which has become possible in the last 10-15 years is that the integral programs have become efficient that you can actually afford to *recompute* the integrals every iteration without *too much* extra expense. What you're doing is trading I/O time (you don't write the integrals to disk anymore, you just use them as you compute them) for CPU time. It works because the integral programs have become fast enough so that they are not too much slower than the I/O time required.

*My* views on those two ideas are the following: In the first case, you're using twice as much disk space and for parallelizing things, you have the same problems several times over. Since I want to do *large* calculations, I want to conserve disk space as much as possible, and the benefits of the sorting are marginal at best, so I don't want to do it. I think the second case is really the way to go, because you can also parallelize the integral evaluation in various ways. The problem here is that there are few research groups around who are really experts on integral evaluation, so it would take *me* a long time to do the parallelization of such a code or I can get it from one of the experts *if* they do it and *if* they are willing to make it available to me. So for the time being, I'll stick to the more traditional method and do what I can.

When you get fancy with the Hartree-Fock procedure, there are several other steps where you might want some kind of parallel I/O. for example, it is useful to apply conjugate-gradient based algorithms to speed up the convergence of the iterative sequence of density matrices. To do this, you typically need some number of the previous density or Fock matrices, which you then form into a linear combination to produce an improved Fock or density matrix. If you can't store them in memory, then you have to get them from disk -- more I/O.

So that's the basic idea. The real research in my group comes in the higher level calculations which improve the Hartree-Fock result. These are the configuration interaction (CI) and couple-cluster (CC) methods, although there are some others too. We do CC methods.

The basic problem in CC methods is very similar to the HF case. We typically need to transform the integral tensor to a new basis (obtained from the HF calculation), which is another disk-to-disk sort procedure with lots of matrix multiplies thrown in. You read in the original integrals and do the first "half transformation" on them -- conceptually two matrix multiplies, and write the results to an intermediate file. You then re-sort the intermediate file so that you have the half-transformed integrals in an order appropriate for the next step. Then the second "half transform" proceeds much like the first, two more matrix multiplies. Now typically, you don't have enough memory to do a complete step at once, so it often requires multiple passes through the files. Remember, we're talking about tens of millions of elements here.

Having now the transformed integrals, we then solve a much larger non-linear equation. The quantities we seek here are rank-2, rank-4, and sometimes rank-6 and rank-8 (depending on the accuracy we require). Once again, these and the related intermediates are generally stored on disk, and the equation is solved iteratively, having to read and write many of these every cycle.

The code we presently use takes the transformed integrals and writes them (using special routines) into a series of large random access files where they are grouped according to their indices and sometimes for computational convenience identical integrals will be stored in different orders. The routines that handle these files have been written to allow for asynchronous I/O of these "lists" on machines where this is possible. That helps a great deal. Also, the lists are stored with a skip factor put into the low-order byte of each integral, so indices don't need to be stored (they're all in known orders) and sparsity is taken advantage of. These lists can be quite sparse depending on the circumstances.

We're working on developing a new code for these calculations where we are using a much simpler file interface (the old code is unmaintainable) and structuring things so that spatial symmetry of the molecule (which shows up in the integrals and the other quantities we seek) is used, and we end up storing many more lists which are generally pretty dense.

I hope this gives you some idea of what we do. Please ask questions to get me pointed in the right direction if you want more/different info.

I should note that while there are many codes around that do these sorts of calculations, most are large (100k lines or more), complex, and old. The don't consider parallelism and aren't likely to if it requires any significant thought -- the way parallelism is presently done, it usually requires quite a bit of recoding. Few people in this field are really interested in taking advantage of parallel computers -- they have their own uniprocessor of one sort or another, and they perhaps use some supercomputers, but would much rather put their efforts into developing new methods than rewriting programs.

I guess that I am, in a sense, one of the exceptions to this idea. Although we don't have a parallel system here, we do have a network of Sun workstations, which I would like to use one day as a MIMD-local memory processor, and I have access to some other parallel machines for *development* purposes.

My interest at the moment stems from the fact that we are presently re-writing our code from scratch in an attempt to modernize it and incorporate some new theoretical developments we've made in the CC method which require a different structure to the algorithm than our old code used. I, of course, want to incorporate parallelism now so that we can take advantage of it later -- it is a chore to rewrite these codes from scratch, so it doesn't get done very often.

followup

Let me see if I understand each scenario where I/O is involved.

writing integrals to disk, from an integral-generating program.

contracting integrals with the density matrix, to make a new Fock matrix, iteratively.

sorting the integrals in a disk-to-disk (external) sort.

storing and retrieving previous matrices in the conjugate-gradient method.

transform integrals to intermediate file, then sort that, then transform that to final file; conceptually, transform1 final

answer

Yes, you've got the primary steps. After the integral transformation, your step 5, the CC calculation has many steps which are analagous to your 2 and occasionally your 4, but typically with "longer" pieces of data -- i.e. higher rank objects involved.

What is the stored size of an "integral"? Fixed or variable size? How many are logically read at a time (ignore buffering for efficiency). How many integrals are we talking about?

The integral value is usually double precision -- ~15 decimal digits. The four indices can each span the range from 1 to the number of basis functions. Most programs limit themselves to 256 basis functions and pack the four indices (they're integers by the way) into a single integer using bit shifting. On something like a Cray, with a long integers, you could limit the number of basis functions to 65535 and still fit them in.

Integral programs usually write out a series of fixed-size records, consisting of a block of integral values, the associated block of indices, and an integer to indicate how many valid data items are in the record (generally it will fill all records but the last, then it writes a dummy record with the integer being and end-of-data flag). The size of the records varies with the program used. In one case, there are 600 integrals per record -- in another 2000. I suspect these figures stem historically from attempts to fill disk sectors or operating system file buffers most efficiently on whatever machine the program was developed on.

Although it's certainly possible, we (as consumers of these programs) generally don't fiddle with the index packing or record size too much.

Do you read the integrals file sequentially? If not, in what order? Does it matter in what order the integrals are read? You mention the integrals are stored in lists; it sounds like a list must be read sequentially, and from the beginning, in order to interpret it correctly. True? Is each integral used only once in any pass? That is, if several processes share the work, would they be reading disjoint sets, or would they jointly read some or all of the file?

There are two cases involved here. First the Hartree-Fock procedure...

We read the integrals sequentially. The integral program generates them in an order computationally convenient for it, but not necessarily for the Hartree -Fock procedure. The sort step I mentioned which some people do between the integral generation and the Hartree-Fock is an attempt to put things in a better order for Hartree-Fock, but with the sparsity and the way they store the result, it really doesn't help them much.

Assuming you can hold your Fock and density matrices in core (which every program I'm familiar with assumes) it always takes exactly one pass through the integrals to make the Fock matrix. We read them as they come and use program logic to get them to the right place. This logic can even be vectorized on certain machines, like a Cray, which have the CVMGT function. On most however, I don't think it can. I'm looking into this.

If several processes were to share the work, it would be simple to set it up so that each process gets to make a partial Fock matrix with the integrals it reads, then all would be added together at the end. It doesnt matter who reads what integrals, just so long as they are all read by someone. So you could have each process read the next record and I guess you'd have to put a lock on it while you're doing so. This is essentially what I proposed in my original note (gee, that was a lot of typing ago). Another way would be to scan the integral file once through to find out how many records there are (in Fortran it is possible to "skip over" a record without having to transfer all of its data, so this can be done pretty rapidly) and have each process seek a certain number of records into the file before beginning to read.

Now for the CC procedure...

Here, the integral handling is a little bit different. The historical develop- ment of our code is that the transformed integrals are used in the form of ordered lists. There are different lists corresponding to different "types" of integrals. This was done to a large extent to simplify the programming of the equations themselves later on. It has proved a very useful technique and has allowed for rapid implementation of new equations. The key is that *most* equations (terms) want a given type of integral in a given order. In some cases, for computational efficiency, the same list is stored with two (or more) different orders -- at the cost of some disk space.

The lists are stored in random access files with enough "directory" information to be able to identify a given matrix within the tensor that is stored. The lists are read a *matrix* at a time (this is a rank-2 component of the rank-4 or rank-6 quantity that makes up a "list"). The order is *usually* but not necessarily sequential within a list. With redundant storage of lists, it could always be sequential, but you have to trade off disk storage and efficiency.

As I mentioned, there is a whole (unmaintainable) set of routines which handle these lists. Logically, all lists are stored on a single random access file. Each list has a different length, but the routines maintain a directory of the necessary info so that any matrix of any list can be called up at will. The code takes advantage of asynchronous I/O if its available and the programmer tells the routines what matrices it will need. It is also possible to split the logical file across multiple disk drives so that some parallelism is acheived if you have multiple controllers. Then, consecutive matrices of each list are distributed among the different disks.

Although this probably sounds pretty nice, but the fact is that it is quite complicated to use and incurrs quite a bit of overhead. Recall that its all done in Fortran. In the rewrite of the CC program, we're probably going to abandon this complex I/O handling and use plain old random access files.

I haven't given much thought yet to parallelizing the CC procedure. There are obvoious ways at the loop level (matrix operations, etc.) but those probably won't do enough for you. And you could conceivably parallelize by having different processes do different terms, but that would give pretty poor load balancing.

What disk access pattern would be presented by the external sorts?

The sort method that everyone in this business uses is due to M. Yoshimine and is published in an IBM Tech Report which I've never seen. Personally I think its really just part of the "oral tradition" of the field, but I'll try to find you the reference if you want. Basically, you're going from unordered input to an ordered rank-4 output by sorting on pairs of indices. In the first step, the integrals come in in random order. You look at the indices and determine which "distribution" they belong to from the first pair. You then look at the second pair to determine the "offset" into the distribution -- now you know exactly where the integral *should* go in the final output, you just have to get it there. So you have in memory a buffer for each distribution. You put the integral value and the computed offset into the buffer (just fill them up sequentially). When any buffer is full, it is written onto a direct access file. Eventually, all integrals are processed and you flush all of the partially full buffers to disk.

Also, each time you wrote a buffer out to disk, you stored its record number in the beginning of the next buffer-full. At the end, after you flush the partial buffers, you write (in the very last record of the file) the list if record numbers for the final record of each distribution.

The second step reads in that list and starts with the first distribution. You read in the final record of that distribution. You place each integral into a large array at its "offset". Follow the "chain" saved at the beginning of each record to get all of the records from the first distribution and then you've got the whole first distribution in memory and you can write it out in whatever records you like. Now do the same for each subsequent distribution.

So reading the integrals is just sequential. The intermediate step is very non-sequential, and requires a lot of backwards seeking in the file in the way I just presented it. The final output is pretty much sequential unless you have something peculiar in mind.

I should also note that in the real world (where there are still VAXen with 4MB core) you can't always fit an entire distribution in memory for the second part of the sort. This means several passes for each distribution. Memory problems in the first part force you to make a choice between the size of buffers you maintain for each distribution and the number of distributions you hold in core at once, so you may have to do several passes through the input file if you don't want unreasonably small buffers.

Since the second step involved only unique integrals which end up in unique array elements, you could easily parallelize this by letting each process read a record and "put away" its integrals -- there will never be a conflict for any array element. The first part is a little tougher, I think, because you'd have to be careful to put locks on the top-of-heap pointers for each buffer so that processes don't try to add to the same element. This would probably mean a lot of contention for the pointer, but it really just depends on the randomness of the input integrals -- I'd have to look at that.

How would the matrices be stored in the C-G method? In what order (sequentially, etc), and in what sizes (elements, variable-length rows, fixed-length rows) are these files read and written?

The way most people do this, they have a sequential or direct access file (possible since the matrices are fixed in size by the basis set chosen) and just add each new matrix onto the end, then retrieve the most recent five (for example -- not many are required) to form the new guess. Since so few are used, I intend to produce a simple "circular buffer" using direct access files, to keep only as many as I need. Eventually, I may want to get fancy and use some data compression on some of the matrices, since in the beginning only a few digits are significant. As you near convergence, the "significance" of the matrices increases, so you couldn't compress them as much, or eventually at all. I have some papers from a guy who does this, but I have not looked at them in detail, so I can't say how I'd handle the files exactly.

Why are there multiple passes in the CC method? Does each pass read the files the same way?

The multiple passes arise because the same integrals or intermediates are often required in several different terms each iteration. As I said above, most often the list will be read in sequence, but it depends on the term being computed.

Is any file updated, or are they all strictly written or strictly read?

In the CC prodecure, many quantities are updated -- in fact the only things that remain fixed are the transformed integrals themselves. Its an iterative procedure

Are the I/O-related parts of the code I/O-bound, or is there enough CPU activity to balance the I/O time. In other words, what is the CPU time vs I/O time ratio like?

I don't have a good feel for this at the moment because the code is not well instrumented. I'd guess that in the Hartree-Fock, you're I/O bound and in the CC there is enough computation to alleviate this. I may be wrong about the CC, though.

- You need on the order of 1GB of scratch disk space. That is, LOTS.

Not all calculations are this "large", but on the high end of CC calculations, this is a good bound.

- One program generates the integrals that will be used in your calculation. This file stores 12 bytes per integral (8-byte value, 4-byte indices), with the integrals grouped into blocks of 600 or 2000, for system reasons. This means 7K to 24K blocksize. The order is not significant.

Yes

- Your program reads the entire integral file, sequentially, once per iteration. The order is not significant, so in a parallel version the processes could read them in any order so long as they were all read exactly once. You mention reading them one by one (presumably one record at a time) in arbitrary order, according to some self-scheduled mechanism (based on an atomic counter). Alternatively, find the file size and divide the file up into sections.

Yes

- As the integrals are read, the program builds the new iterate. In parallel, each process would make its own matrix, then they would be added to make the new iterate.

This is the "traditional" way for those in the field who work on parallelization. I don't know if it is the best or the only way, but *I* haven't thought of a better one yet.

- To speed convergence of the basic method, conjugate-gradient methods may be used. This means storing the last few iterate matrices. They are fixed size. One way is to write the new matrix to the end of the file, then read the last few matrices. Or, arrange the file as a circular queue, and write to the end of the queue, and read the last few matrices. Sequential write of one matrix, sequential read of several contiguous (except for possible seek to begin of file) matrices.

Yes, I think the circular queue is really much better, but I know of noone who uses it.

- For other variants of the basic method, the integrals are sorted first. The sort works in two phases: 1) Sequentially read the integrals from the start file, dropping them into buckets based on the indices. Buckets are linked chains of blocks. As a block is filled for a bucket, it is written to a temp file, and the memory can be reused. Thus only one active block for each bucket. Blocks kept linked in temp file, and at the end is a directory of head pointers for each bucket. 2) Read each bucket from temp file, which means a random-access scan of the file. Sort the bucket in memory, ideally. Then write the bucket to the final file. If not enough memory, multiple passes are used to sort the bucket.

You say it much more succinctly than I do.

- For the CC method, a transformation of the integrals is mixed in to the above sort. This just means there is more computation involved with each sort pass. This is only done once [CORRECT?]. The output is in a different form using lists of integrals. These lists must be read in order, and this is usually sequential. The lists vary in length. A directory helps to find the start of each list.

This is all correct. The transformation is done only once.

- In CC method, the transformed-integrals file is read for each iteration, but by lists, rather than sequentially. This means a seek to the beginning of the list and a sequential read of the list. A list may be read more than once per iteration.

Yes

- In CC method they store many intermediate values on disk. In each iteration they sequentially read a lot of values, and sequentially write a lot of values. Not randomly updating a file. [CORRECT?]

Yes. This stems from the fact that in order to compute the terms, you're reading in ordered integrals and intermediates -- so the result is likewise the result will be in some order.

I should probably also mention that in the CC method, when we're computing terms, the disk access is generally the following:

	Read a matrix of integrals or intermediates
	Compute the result
	Write the result matrix
	Next matrix

So that in going through a (rank-4, remember) list you pickup only a matrix at a time and compute and write the result -- you don't act on the whole list a once. This probably changes you idea of the disk access patterns somewhat.

I submit that many programmers can tell you what the next bit if data will be. Is it reasonable to *allow* (not require) the programmer to indicate this with some kind of call, like (in Fortran)

	Call PFetch(UNIT [,RECORD] [,REWIND])

Where the RECORD argument would be valid for direct access files, otherwise it would just work sequentially, and the REWIND could tell it if you were about to do something totally out of sequence, so it wouldn't bother with the prefetch. This is similar to what you do in typical asynchronous I/O implementations, but here you're giving hints instead of directly initiating the I/O. The difference is that *I* don't have to worry about the horrors or programming up a double or triple buffer for the I/O to be done. I'm assuming here that if the programmer doesn't make calls to PFetch, the OS is going to do the best it can anyway. Then PFecth calls would only be necessary when you do something "usual".

Of course as a programmer, I now have to say that I much prefer everything be standard, but I supose that's too much to ask at this point in this area. If you do intend to provide user-level control or assistance in any way similar to my idea above, I think the foremost thing should be that it be simple to retrofit into existing codes.

-- anonymous 1990