\documentclass[11pt]{article} \usepackage[mtbold]{mathtime} \usepackage{graphics} \usepackage{psfrag} \input{thcmac2e} \input{page} \newcommand{\D}{{\cal D}} \newcommand{\Proc}{{\cal P}} \newcommand{\bits}[3]{#1_{#2\range#3}} \newcommand{\zceil}[1]{\lceil#1\rceil} % ceiling without autosized delims \makeatletter \renewcommand\section{\@startsection {section}{1}{\z@}% % {-3.5ex \@plus -1ex \@minus -.2ex}% % {2.3ex \@plus.2ex}% {-2.5ex \@plus -1ex \@minus -.2ex}% {1.5ex \@plus.2ex}% {\normalfont\Large\bfseries}} \renewcommand\subsection{\@startsection{subsection}{2}{\z@}% % {-3.25ex\@plus -1ex \@minus -.2ex}% % {1.5ex \@plus .2ex}% {-2.5ex\@plus -1ex \@minus -.2ex}% {1ex \@plus .2ex}% {\normalfont\large\bfseries}} \renewcommand\subsubsection{\@startsection{subsubsection}{3}{\z@}% % {-3.25ex\@plus -1ex \@minus -.2ex}% % {1.5ex \@plus .2ex}% {-2.5ex\@plus -1ex \@minus -.2ex}% {1ex \@plus .2ex}% {\normalfont\normalsize\bfseries}} \def\@listi{\leftmargin\leftmargini \parsep 2.5\p@ \@plus2\p@ \@minus\p@ \topsep 4\p@ \@plus3\p@ \@minus5\p@ \itemsep2.5\p@ \@plus2\p@ \@minus\p@} \let\@listI\@listi \@listi \makeatother \renewcommand{\baselinestretch}{.95} \normalsize \begin{document} \bibliographystyle{alpha} \thispagestyle{empty} \pagestyle{empty} \title{\vspace*{-40pt}% Relaxing the Problem-Size Bound for Out-of-Core Columnsort} \author{Geeta Chaudhry \\ Elizabeth A. Hamon \\ Thomas H. Cormen \\[1ex] Dartmouth College Department of Computer Science \\ \{geetac, hamon, thc\}@cs.dartmouth.edu} \def\thefootnote{\mbox{}} \footnotetext{This research was supported in part by NSF Grant EIA-98-02068.} \makeatletter \def\thefootnote{\@arabic\c@footnote} \makeatother \date{} \maketitle \begin{abstract} Previous implementations of out-of-core columnsort limit the problem size to $N \leq \sqrt{(M/P)^3 / 2}$, where $N$ is the number of records to sort, $P$ is the number of processors, and $M$ is the total number of records that the entire system can hold in its memory (so that $M/P$ is the number of records that a single processor can hold in its memory). We implemented two variations to out-of-core columnsort that relax this restriction. Subblock columnsort is based on an algorithmic modification of the underlying columnsort algorithm, and it improves the problem-size bound to $N \leq (M/P)^{5/3} / 4^{2/3}$ but at the cost of additional disk I/O\@. $M$-columnsort changes the notion of the column size in columnsort, improving the maximum problem size to $N \leq \sqrt{M^3 / 2}$ but at the cost of additional computation and communication. Experimental results on a Beowulf cluster show that both subblock columnsort and $M$-columnsort run well but that $M$-columnsort is faster. A further advantage of $M$-columnsort is that it handles a wider range of problem sizes than subblock columnsort. \end{abstract} \section{Introduction} \label{sec:intro} Sorting very large data sets is a key subroutine in many applications. For some applications, the amount of data exceeds the capacity of main memory (we call these ``out-of-core'' problems), and the data then typically reside on one or more disks. For example, geographical information systems, seismic modeling programs, and web-search engines store and search through enormous amounts of data. In earlier papers \cite{ChaudhryCoWi01,ChaudhryCo02}, the authors have reported on various programs that sort out-of-core data on distributed-memory clusters. All of these programs are based on Leighton's 8-step columnsort algorithm \cite{Leighton85}. In our adaptation of the columnsort algorithm to an out-of-core setting, we end up with a restriction on the maximum problem size that we can sort. We shall refer to this restriction as the \defn{problem-size restriction}. The maximum number $N$ of records\footnote{Each record contains a \defn{key} according to which the records are to be sorted.} that we can sort on a cluster with $P$ processors and $M$ records of memory overall is \begin{equation} N \leq \sqrt{(M/P)^3 / 2} \ . \label{eq:N-orig-restr} \end{equation} Here, $M/P$ is the number of records that a single processor can hold in its memory.\footnote{In reality, $M/P$ is smaller than the actual size of the physical memory of a processor since we need some auxiliary buffers for in-core computation and interprocessor communication.} There are two sources of this restriction: \begin{enumerate} \item \defn{The height restriction.} Columnsort sorts $N$ values in an $r \times s$ matrix, subject to some restrictions. One of the restrictions is $r \geq 2s^2$, so that the matrix is tall and thin.\footnote{Leighton's original paper \cite{Leighton85} has the restriction $r \geq 2(s-1)^2$. We choose to ignore the low-order terms and use the simpler and more stringent $r \geq 2s^2$.} \item \defn{The height interpretation.} In our prior implementations of columnsort, we require each column of the $r \times s$ matrix to fit in the internal memory of a processor. Setting $r$ to be $M/P$, setting $s$ to be $N/r$ and substituting these values of $r$ and $s$ into the height restriction gives us the problem-size restriction~\eqnref{eq:N-orig-restr}. \end{enumerate} In the present paper, we explore two approaches to relax the problem-size restriction. These two approaches stem from attacking separately the height restriction and the height interpretation. \begin{description} \item[Subblock columnsort:] We relax the height restriction by adding two new steps to columnsort. The resulting algorithm, which we call \defn{subblock columnsort}, relaxes the height restriction by a factor of $\sqrt{s}/2$, to $r \geq 4s^{3/2}$. With the same height interpretation of $r = M/P$, we get a problem-size restriction of \begin{equation} N \leq (M/P)^{5/3} / 4^{2/3} \ . \label{eq:N-rev-restr} \end{equation} This improvement in problem size can be quite substantial in an out-of-core setting. For most current systems ($M/P \geq 2^{12}$ records), this change will enable us to more than double the largest problem size. % if M/P is => 2^12, so, M/P^5/3 => r^20 / 4 = r^18. and % r^3/2 = r^18 => r^3/2 / 2 = r^17. therefore, we more than double. % generally r^5/3 / 4 > 2r^3/2 / 2 <=> % let r=M/P = 2^k. so % 2^(5k/3 - 2) > 2^(3k/2) <=> 5k/3 -2 > 3k/2 <=> k > 12 => M/P > 2^12 or 4KB. % if M/P is 2^17, the problem size becomes more than 4 times. Our best previous adaptation of columnsort to an out-of-core setting, which we call \defn{threaded columnsort}, is structured into three passes, where a \defn{pass} consists of reading each record once from disk, doing some computation, and writing it back to disk. The two additional steps of subblock columnsort add an extra pass, leading to additional disk I/O, communication, and computation. In certain special cases, this pass involves no communication. \item[\textit{M}-columnsort:] As expressed in restrictions \eqnref{eq:N-orig-restr} and~\eqnref{eq:N-rev-restr}, the maximum problem size depends on the amount of memory per processor, or $M/P$, even with the relaxed height restriction achieved by subblock columnsort. Therefore, if the number of processors in the cluster increases, but the amount of memory per processor stays fixed, the maximum problem size remains unchanged. This lack of scalability is due to the height interpretation. \defn{$M$-columnsort} changes the height interpretation from $r = M/P$ to $r = M$, thus leading to the improved problem-size restriction \begin{equation} N \leq \sqrt{M^3 / 2} \ . \label{eq:N-mpcol-restr} \end{equation} On a cluster with 16 processors, with $M/P = 2^{19}$ records, this change will allow us to sort up to one terabyte of data, assuming a record size of 64 bytes. $M$-columnsort does not add any extra passes compared to the original columnsort. As we shall see in Section~\ref{sec:mpcol}, however, it does incur substantial amounts of communication and additional computation. \end{description} Experimental results on a Beowulf cluster with fast processors and a Myrinet interconnect show that the primary determinants of out-of-core execution time for a given algorithm are the amount of data per processor to be sorted and the amount of memory used on each processor. The dependence on data per processor is a consequence of the system being I/O-bound. Since subblock columnsort has four passes to threaded columnsort's three, subblock columnsort takes approximately one-third longer. $M$-columnsort, though it has exactly three passes, takes longer than threaded columnsort due to its increased computation and communication. Although $M$-columnsort takes longer than threaded columnsort, it runs faster than subblock columnsort in all cases. The remainder of this paper is organized as follows. Section~\ref{sec:background} describes the original columnsort algorithm and summarizes earlier implementations. Sections \ref{sec:rev} and~\ref{sec:mpcol} present the design of subblock columnsort and $M$-columnsort, respectively, along with notes on their implementations. Section~\ref{sec:experiment} analyses experimental results of the various columnsort programs. Finally, Section~\ref{sec:conclusion} offers some final comments and discusses future work. \section{Columnsort} \label{sec:background} In this section, we review the columnsort algorithm and earlier adaptations of it to an out-of-core setting. We conclude this section by recalling the implications of the problem-size restriction. \subheading{The basic columnsort algorithm} Columnsort sorts $N$ records arranged as an $r \times s$ matrix, where $N = rs$, $s$ divides $r$, and $r \geq 2s^2$. When columnsort completes, the matrix is sorted in column-major order. Columnsort proceeds in eight steps. Steps 1, 3, 5, and~7 are all the same: sort each column individually. Each of steps 2, 4, 6, and~8 permutes the matrix entries as follows: \begin{itemize} \item \textit{Step 2: Transpose and reshape:} We first transpose the $r \times s$ matrix into an $s \times r$ matrix. Then we ``reshape'' it back into an $r \times s$ matrix. For example, in a $6 \times 3$ matrix, the column with $r = 6$ entries $a\ b\ c\ d\ e\ f$ is transposed into a 6-entry row with entries $a\ b\ c\ d\ e\ f$ and then reshaped into the $2 \times 3$ submatrix {\small$\left[\begin{array}{ccc}a & b & c \\ d & e & f \end{array}\right]$\normalarray}.\\[-2ex] \item \textit{Step 4: Reshape and transpose:} This permutation is the inverse of that of step~2. \item \textit{Step 6: Shift down by $r/2$:} We shift each column down by $r/2$ positions, wrapping the bottom half of each column into the top half of the next column. The top half of the leftmost column is filled with $-\infty$ keys and a new rightmost column is created, with its bottom half filled with $\infty$ keys. \item \textit{Step 8: Shift up by $r/2$:} This permutation is the inverse of that of step~6. \end{itemize} \subheading{Out-of-core columnsort} In our earlier adaptations of columnsort to an out-of-core setting on a distributed-memory cluster, we assume that the cluster has $P$ processors $\Proc_0, \Proc_1, \ldots, \Proc_{P-1}$ and $D$ disks $\D_0, \D_1, \ldots, \D_{D-1}$, where $D \geq P$. A processor \defn{owns} the $D/P$ disks that it accesses.\footnote{When $D \geq P$, each processor accesses exactly $D/P$ disks over the entire course of the algorithm. When $D < P$, we require that there be $P/D$ processors per node and that they share the node's disk; in this case, each processor accesses a distinct portion of the disk. We treat this distinct portion as a separate ``virtual disk,'' allowing us to assume that $D \geq P$.} Buffers hold exactly $r$ records. The data are placed so that each column is stored in contiguous locations on the disks owned by a single processor. Specifically, processor~$j$ \defn{owns} columns $j, j+P, j+2P$, and so on. We further assume that all configuration parameters as well as the matrix dimensions $r$ and~$s$ are powers of~$2$. (Thus, $P$ must divide~$D$.) Here, we outline the basic structure of each pass; for key implementation features and performance results, see \cite{ChaudhryCoWi01, ChaudhryCo02}. Each pass in our first implementation performs two consecutive steps of columnsort. That is, pass~1 performs steps 1 and~2, pass~2 performs steps 3 and~4, pass~3 performs steps 5 and~6, and pass~4 performs steps 7 and~8. Each pass is decomposed into $s/P$ \defn{rounds}. Each round processes the next set of $P$ consecutive columns, one column per processor. A round progresses through a pipeline with the following five stages: \begin{description} \item [Read stage:] Each processor reads a column of $r$ records from the disks that it owns. \item [Sort stage:] Each processor locally sorts, in memory, the $r$ records it has just read. Implementation of this stage differs from pass to pass.\footnote{In a given pass~$p$, the data might start with some sorted runs, depending on the write pattern of pass~$p-1$. The implementation takes advantage of the sorted runs to sort by merging.} \item [Communicate stage:] Each record is destined for a specific column, depending on which even-numbered columnsort step this pass is performing. In order to get each record to the processor that owns this destination column, processors exchange records. \item [Permute stage:] Having received records from other processors, each processor rearranges them into the correct order for writing. \item [Write stage:] Each processor writes a set of $r$ records onto the disks that it owns. \end{description} Because we implemented the stages asynchronously, at any one time each stage could be working on a different round. The key features of our previous work are as follows: \begin{itemize} \item The first implementation \cite{ChaudhryCoWi01} used asynchronous I/O and asynchronous communication to overlap I/O, computation, and communication. This implementation had performance results that, by certain measures, made it competitive with the NOW-Sort program \cite{Arpaci-DusseauArCuHePa97}. \item The second implementation \cite{ChaudhryCo02} used threads in order to provide greater flexibility in overlapping I/O, computation, and communication. Experimental results showed that this improvement reduced the running time to about half of that of the first implementation. In this implementation, there were four threads per processor. The sort, communicate, and permute stages each had their own threads, and the read and write stages shared an I/O thread. \item The third implementation reduced the number of passes from four down to three by combining the last two passes into a single pass. The pipeline for the first two passes is unchanged. For the last pass, the pipeline had seven stages, two of which were sort stages and two of which were communicate stages; for details see \cite{ChaudhryCo02}. Subblock columnsort and $M$-columnsort use this implementation as the starting point. We refer to this 3-pass implementation as the \defn{threaded columnsort program}. \item All the implementations use only standard, off-the-shelf software, such as MPI \cite{SnirOtHuWaDo98} and \mbox{MPI-2} \cite{GroppHuLuLuNiSaSn98} for communication and I/O\@. \item There are no assumptions required about the keys. In fact, our algorithm's I/O and communication patterns are oblivious to the keys. \item The output appears in the standard striped ordering used by the Parallel Disk Model (PDM).\footnote{PDM ordering balances the load for any consecutive set of records across processors and disks as evenly as possible. A further advantage to producing sorted output in PDM ordering is that our algorithm can be used as a subroutine in other PDM algorithms. To the best of our knowledge, the implementations in \cite{ChaudhryCoWi01, ChaudhryCo02} are the first multiprocessor sorting algorithms whose output is in PDM order.} \end{itemize} \subheading{Recalling the problem-size restriction} We conclude this section by recalling restriction~\eqnref{eq:N-orig-restr} from Section~\ref{sec:intro}. All of the previous implementations are subject to this problem-size restriction, since they use the original columnsort algorithm, inheriting its height restriction, and they set $r$ to be $M/P$. In other words, substituting $r = M/P$ and $s = N/r = NP/M$ in the height restriction $r \geq 2s^2$ gives restriction~\eqnref{eq:N-orig-restr}. There are two main implications of this restriction. The first implication is the obvious one: the maximum problem size has an upper bound. The second implication is one of scalability. As we can see in restriction~\eqnref{eq:N-orig-restr}, the problem size $N$ depends on $M/P$, the memory per processor, rather than on the total memory $M$ of the system. \section{Subblock columnsort} \label{sec:rev} In this section, we present subblock columnsort and discuss some aspects of its implementation. \subheading{The algorithm} To describe subblock columnsort, we need to define what a \defn{subblock} is. We will be working with $\sqrt{s} \times \sqrt{s}$ subblocks of the matrix, where each subblock is a contiguous set of $\sqrt{s}$ rows and $\sqrt{s}$ columns. Subblocks are aligned to the matrix, meaning that the indices of the top row and leftmost column of each subblock must be multiples of~$\sqrt{s}$. We need $\sqrt{s}$ to be an integer, which, combined with the power-of-$2$ assumption, constrains $s$ to be a power of~$4$. The main idea behind subblock columnsort, inspired by the Revsort algorithm \cite{SchnorrSh86}, is to add two extra steps after step~3. Subblock columnsort consists of the following ten steps: \begin{itemize} \item Do steps 1--3 of columnsort. \item Step 3.1 performs any permutation that moves all the values in each $\sqrt{s} \times \sqrt{s}$ subblock into all $s$ columns. We shall refer to this property as the \defn{subblock property}. \item Step 3.2 sorts each column. \item Do steps 4--8 of columnsort. \end{itemize} In \cite{ChaudhryCo03}, it is shown that as long as $s$ divides~$r$, $s$ is a power of~$4$, and $r \geq 4s^{3/2}$, subblock columnsort sorts any input correctly. This modified height restriction implies the problem-size restriction~\eqnref{eq:N-rev-restr}. As discussed in \cite{ChaudhryCo03}, there are several permutations that have the subblock property. The one we use here, which we call the \defn{subblock permutation}, is based on permuting sets of bits within the row and column numbers. Because we assume that $r$ and~$s$ are powers of~$2$, each row number is a sequence of $\lg r$ bits and each column number is a sequence of $\lg s$ bits. We shall express the subblock permutation in terms of the source row and column numbers and the corresponding target row and column numbers of each matrix element. \begin{figure} \begin{center} \psfrag{Zi}[Bl][Bl][0.9]{$i$} \psfrag{Zj}[Bl][Bl][0.9]{$j$} \psfrag{Zip}[Bl][Bl][0.9]{$i'$} \psfrag{Zjp}[Bl][Bl][0.9]{$j'$} \psfrag{Zw}[Bc][Bc][0.9]{$w$} \psfrag{Zx}[Bc][Bc][0.9]{$x$} \psfrag{Zy}[Bc][Bc][0.9]{$y$} \psfrag{Zz}[Bc][Bc][0.9]{$z$} \psfrag{Zlgrrsbits}[Bc][Bc][0.9]{$\lg r/\sqrt{s}$ bits} \psfrag{Zlgrsbits}[Bc][Bc][0.9]{$\lg \sqrt{s}$ bits} \psfrag{Z0}[tr][tr][0.7]{$0$} \psfrag{Zlgrsm1}[tl][tl][0.7]{$\lg \sqrt{s}-1$} \psfrag{Zlgrs}[tr][tr][0.7]{$\lg \sqrt{s}$} \psfrag{Zlgrm1}[tl][tl][0.7]{$\lg r-1$} \psfrag{Zlgsm1}[tl][tl][0.7]{$\lg s-1$} \psfrag{Zlgrrsm1}[tl][tl][0.7]{$\lg r/\sqrt{s}-1$} \psfrag{Zlgrrs}[tr][tr][0.7]{$\lg r/\sqrt{s}$} \includegraphics{bit-mapping.eps} \end{center} \vspace*{-2ex} \figcaption{The subblock permutation as a bit permutation. Each entry in the $r \times s$ matrix has a row number, expressed in $\lg r$ bits, and a column number, expressed in $\lg s$ bits. The subblock permutation permutes the element in row $i$ and column $j$ to row $i'$ and column $j'$. We number the bits starting from $0$ as the least significant bit, and we use the notation ``$\twodots$'' to denote ranges of consecutive bits. The permutation maps bits $w = \bits{i}{\lg \sqrt{s}}{\lg r-1}$ to $\bits{i'}{0}{\lg r/\sqrt{s}-1}$, $x = \bits{i}{0}{\lg \sqrt{s}-1}$ to $\bits{j'}{\lg \sqrt{s}}{\lg s-1}$, $y = \bits{j}{\lg \sqrt{s}}{\lg s-1}$ to $\bits{i'}{\lg r/\sqrt{s}}{\lg r-1}$, and $z = \bits{j}{0}{\lg \sqrt{s}-1}$ to $\bits{j'}{0}{\lg \sqrt{s}-1}$. Because $x$ determines the source row number within an element's subblock and $z$ determines the source column number within the subblock, this mapping ensures that the bits forming an element's target column number come from the bits that determine where in a source subblock the element started.} \label{fig:bit-mapping} \end{figure} To ensure that the subblock property holds, we only need to show that two distinct elements in the same source subblock will map to two different column numbers. We do so by ensuring that the $\lg s$ bits that determine the target column number come from source bits that determine where in a $\sqrt{s} \times \sqrt{s}$ subblock a matrix element resides. Figure~\ref{fig:bit-mapping} shows the idea. If we look at the source row and column numbers of a given element, the least significant $\lg \sqrt{s}$ bits of each---denoted by $x$ and~$z$, respectively, in the figure---determine the row and column numbers of that element within its subblock. (The most significant bits---$w$ and~$y$---determine which subblock the element is in, but not where in the subblock it resides.) The subsequences $x$ and~$z$ form the bits of the target column number, with $z$ forming the least significant half and $x$ forming the most significant half. Thus, the subblock permutation has the subblock property. As an arithmetic formula, the subblock permutation maps the $(i,j)$ entry to position $(i',j')$, where \begin{eqnarray*} i' & = & \floor{\frac{j}{\sqrt{s}}}\frac{r}{\sqrt{s}} + \floor{\frac{i}{\sqrt{s}}} \ , \\ j'& = & j \bmod \sqrt{s} + (i \bmod \sqrt{s})\sqrt{s} \ . \end{eqnarray*} It may seem strange that the target row number is formed by using $w$ as the least significant bits, when $w$ started out as the most significant bits of the source row number. The advantage of permuting in this way is that it creates sorted runs of $r/\sqrt{s}$ elements in each column. To see why, first observe that entering the subblock permutation, each column is sorted. Now consider two elements $e_1$ and~$e_2$ that start in the same column (so that their $y$ and~$z$ bits are the same) and are permuted into the same target column (so that their $x$ bits are also the same). There are $r/\sqrt{s}$ elements in a source column that have fixed values of the $x$, $y$, and~$z$ bits, and they vary in their $w$ bits. Let the $w$ bits of $e_1$ and~$e_2$ be $w_1$ and~$w_2$, respectively, and assume that $w_2 = w_1 + 1$ (so that $e_2$ is $\sqrt{s}$ rows below $e_1$ in the source column and $e_1 \leq e_2$). Because $e_1$ and~$e_2$ have the same $y$ bits, which become the most significant bits of the target row, and because the $w$ bits become the least significant bits of the target row, $e_2$'s target row is $1$ greater than $e_1$'s target row. Since $e_1$ and~$e_2$ are any two elements that start out $\sqrt{s}$ rows apart in their source column, we see that all $r/\sqrt{s}$ elements that start in the same source column and are permuted to the same target column appear as one sorted run of size~$r/\sqrt{s}$ in the target column. \subheading{Implementation notes} Since subblock columnsort differs from columnsort only in the two additional steps, our implementation of subblock columnsort started with the 3-pass threaded columnsort program and integrated these two steps as one extra pass, which we call the \defn{subblock pass}. The subblock pass performs steps 3 and~3.1 of subblock columnsort. The overall thread structure of subblock columnsort is the same as that of the threaded columnsort program. % We have one thread for disk I/O, one thread for sorting, one thread % for interprocessor communication and finally one thread for the % permuting that is done before writing. Subblock columnsort has 4 % passes, pass~1 performs steps 1 and~2, pass~2 performs steps 3 % and~3.1, pass~3 (or the subblock pass) performs steps 3.2 and~4, and % finally pass~4 performs steps~5--8. Passes 1, 3, and~4 of subblock % columnsort are the same as passes 1, 2, and~3 of the threaded % columnsort program, respectively.\footnote{The sort phase of pass~3 of % subblock columnsort differs slightly from the sort phase of pass~2 of % the threaded columnsort program. This is so because the size of the % sorted runs produced by step~3.1 of subblock columnsort is larger than % the size of the sorted runs produced by step~2 of columnsort by a % factor of~$\sqrt{s}$.} Like the first two passes in threaded columnsort, the subblock pass is decomposed into $s/P$ rounds, where each round processes the next set of $P$ consecutive columns, one column per processor, in a five-stage pipeline. The only stage of the subblock pass that differs substantially from the corresponding stage of passes 1 and~2 of threaded columnsort is the communicate stage. In each round's communicate stage within passes 1 and~2 of threaded columnsort, each processor sends $P$ messages. Each message consists of $r/P$ records. One of these messages goes back to the sending processor, in which case the message does not need to go over the network. For the subblock pass, however, we shall show three properties: \begin{enumerate} \item In the communicate stage of each round, each processor sends only $\zceil{P/\sqrt{s}}$ messages, each of size $r/\zceil{P/\sqrt{s}}$. \item When $\sqrt{s} \geq P$ so that $\zceil{P/\sqrt{s}} = 1$, the one message is always destined for the sending processor, and therefore no communication over the network occurs. \item Any permutation that achieves the subblock property must send at least $\zceil{P/\sqrt{s}}$ messages per round, and so the communication pattern of our subblock pass is optimal. \end{enumerate} To see that the first two properties hold, we again refer to Figure~\ref{fig:bit-mapping}. Let us examine the processors to which column~$j$'s elements are mapped by the subblock permutation. Column~$j$ is owned by processor~$p$, where $p = j \bmod P$. This processor number is the $\lg P$ least significant bits of the column number~$j$. If $P \leq \sqrt{s}$, then the bits that determine the processor number are entirely within the $z$ field in Figure~\ref{fig:bit-mapping}. Since these bits are the same in the target column as they are in the source column, the target processor number must be the same as the source processor number. Thus, since $\zceil{P/\sqrt{s}} = 1$, we have proven property~2. If $P > \sqrt{s}$, then the $\lg \sqrt{s}$ least significant bits of the source and target column numbers for a given element are the same. Therefore, only $\lg P - \lg \sqrt{s}$ of the bits determining the processor number can differ. These bits come from the $x$ field, which is part of the row number. All combinations of these bits will occur in a given source column, and so all combinations will occur in the target processor number. There are $2^{\lg P - \lg \sqrt{s}} = P/\sqrt{s}$ such combinations. By the power-of-$2$ assumption, therefore, we have \mbox{$\zceil{P/\sqrt{s}} = P/\sqrt{s}$}, and so we have proven property~1. To show property~3, we shall show that if the subblock property holds and any source column maps to fewer than $P/\sqrt{s}$ target processors, then a contradiction arises. We start by noting that every processor must own exactly $s/P$ columns. Now let us suppose that some source column, say column~$j$, maps to~$k$ target processors, where $k < P/\sqrt{s}$. Therefore, column~$j$ maps to fewer than $k(s/P)$ target columns. Since $k < P/\sqrt{s}$, we have that $k(s/P) < \sqrt{s}$, so that column~$j$ maps to fewer than $\sqrt{s}$ target columns. Because the subblock property holds, any $\sqrt{s}$ entries within a given subblock must map to $\sqrt{s}$ different columns. If we consider the intersection of any subblock with column~$j$, we have $\sqrt{s}$ entries of the subblock, and hence this portion of column~$j$ (not even considering the rest of the column) must map to $\sqrt{s}$ different target columns. This fact contradicts our earlier conclusion that column~$j$ maps to fewer than $\sqrt{s}$ target columns. Thus, we have proven property~3. \section{\textbf{\textit{M}}-columnsort} \label{sec:mpcol} In order to achieve the problem-size restriction~\eqnref{eq:N-mpcol-restr}, $N \leq \sqrt{M^3/2}$, we consider each column to be $M$ elements, so that $r = M$. Recall that with this more relaxed restriction, the maximum problem size now scales with the memory in the entire system, so that adding more processors with the same amount of memory per processor increases the maximum problem size. In fact, this increase is superlinear in the total memory size. \subheading{Implementation notes} As with subblock columnsort, our implementation of $M$-columnsort is a modification of 3-pass threaded columnsort. Instead of adding a pass, however, we increase the complexity of the sort stage of each pass. When $r = M/P$, the sort stage is just a local sort on each processor. In $M$-columnsort, since $r = M$, the sort stage becomes a multiprocessor sort with distributed memory. One benefit of performing a multiprocessor sort is that we can eliminate the communicate stage in the first two passes. We use in-core columnsort for the distributed-memory multiprocessor sort. We implemented three in-core multiprocessor sorting algorithms: bitonic sort, radix sort, and columnsort. We found that in-core columnsort, with an $(M/P) \times P$ matrix, was consistently faster than bitonic sort on problem sizes representative of those we encounter in the sort stage. Radix sort was competitive with in-core columnsort over a wide range of problem sizes, but we decided to use in-core columnsort because radix sort has a high dependence on the key format and because columnsort's communication patterns are independent of the values in the keys. Our implementation of in-core columnsort is multithreaded. In particular, there are two threads: one for local sorting (steps 1, 3, 5, and~7 of the in-core sort) and one for communication (steps 2, 4, 6, and~8 of the in-core sort). These threads are in addition to the four non-sort threads inherited from threaded columnsort. Wherever threaded columnsort's pipeline had a single stage for sorting, $M$-columnsort's pipeline has eight stages. Each stage in a pipeline sends a buffer to its successor, and the additional threads in $M$-columnsort require the allocation of four additional buffers. The pipeline for each of the first two passes has 11 stages: read, the eight from in-core columnsort, permute, and write. These 11 stages occupy only four threads: one for both read and write, one for permute, one for the four local sorting steps of in-core columnsort, and one for the four communication steps of in-core columnsort. We are able to eliminate the communicate stage from the out-of-core pipeline because each column is shared among all the processors. After the in-core sort, each processor has its own portion of the sorted data. Depending on the permutation stage of the out-of-core sort, this data is destined for a certain set of target columns. Since each processor owns a portion of each column, we were able to design the in-core sort to ensure that each processor can write out its data into its portion of each of the target columns. The pipeline for the last pass is rather different. We eliminate one communicate stage, but each of the two sort stages turns into eight in-core sort stages. Although the resulting pipeline has 20 stages, they occupy only seven threads: one for both read and write, one for permute, one for the remaining communicate stage, and two for each of the two in-core sorts. \section{Experimental results} \label{sec:experiment} The experimental results came from runs on a Beowulf cluster that has 16 dual 1.5-GHz P4 Xeon nodes, each with 1~GB of RAM\@. The nodes run Redhat Linux~7.2 and are connected with a high-speed Myrinet network which has a peak speed of 250~MB per second. Each node has an Ultra-160 10,000-RPM SCSI-3 hard drive, with about 10~GB of available disk space for user files. For disk I/O, we use the C \texttt{stdio} interface. The nodes communicate using standard synchronous MPI \cite{SnirOtHuWaDo98} calls within asynchronous threads. Our threads are implemented using the pthreads package of Linux. Not all MPI implementations work correctly in the presence of multiple threads. We found that MPI/Pro, which allows multiple threads to issue MPI calls, worked well. Although each node has two processors, we found no advantage to running more than one process per node. Thus, we treat each node as if it had only one processor, and we use the terms ``node'' and ``processor'' interchangeably. Our experimental runs were for combinations of the following: \begin{description} \item[Algorithm:] We ran threaded columnsort, subblock columnsort, and $M$-columnsort. For a baseline, we also ran just the I/O portions of three and four passes of columnsort. In this way, we can determine how much time of each run was spent waiting for non-I/O activity. \sloppy \item[Buffer size:] For threaded columnsort, subblock columnsort, and $M$-columnsort, we varied the buffer size~($r$). We report here results for buffer sizes of $2^{24}$ and $2^{25}$ bytes. Note that these buffer sizes, being in bytes, are not in terms of number of records, and so they should not be construed as equaling $M/P$. Record sizes varied between 64 and 128 bytes, but we found that the buffer size mattered more than the record size. \fussy \enlargethispage{-6ex} \item[Number of processors and volume of data:] We ran various combinations with 4, 8, and~16 processors and with an amount of data varying from 4~GB up to 32~GB\@. We did not run any experiments with less than 1~GB of data per processor because file-caching effects masked the out-of-core nature of the problem. We were unable to perform any runs with more than 2~GB of data per processor due to disk-space limitations.\footnote{We did not overwrite the original data files so that we could verify the correctness of the output files. Moreover, our implementation requires a temporary file. Therefore, the input, output, and temporary files together induce a disk-space requirement of three times that of the input file size. The cluster on which we ran our experiments did not permit us to use much more than 6~GB of disk space per node.} All runs, therefore, were for either 1~GB or 2~GB of data per processor. \end{description} For a given algorithm and buffer size, we found that the amount of data per processor was by far the most important factor in determining run time. Given the large amount of disk I/O that each of the algorithms has to perform, this characteristic did not come as a surprise. We couch our results in terms of GB of data per processor. \begin{figure} \begin{center} \psfrag{ZSubblock24}[Bl][Bl][0.9]{Subblock columnsort, buffer size = $2^{24}$} \psfrag{ZSubblock25}[Bl][Bl][0.9]{Subblock columnsort, buffer size = $2^{25}$} \psfrag{ZM24}[Bl][Bl][0.9]{$M$-columnsort, buffer size = $2^{24}$} \psfrag{ZM25}[Bl][Bl][0.9]{$M$-columnsort, buffer size = $2^{25}$} \psfrag{ZThreaded24}[Bl][Bl][0.9]{Threaded columnsort, buffer size = $2^{24}$} \psfrag{ZThreaded25}[Bl][Bl][0.9]{Threaded columnsort, buffer size = $2^{25}$} \psfrag{ZIO4}[Bl][Bl][0.9]{Baseline I/O time, 4 passes} \psfrag{ZIO3}[Bl][Bl][0.9]{Baseline I/O time, 3 passes} \resizebox{!}{4in}{\includegraphics{spaa-timings.eps}} \end{center} \vspace*{-2ex} \figcaption{Execution times, in seconds, for the three versions of columnsort plus baseline I/O times for three and four passes.} \label{fig:experiment} \end{figure} Figure~\ref{fig:experiment} summarizes our experimental results. Each plotted point in the figure represents the average of multiple runs of an algorithm with a given buffer size, but with the number of processors and the record size varying. Variations in running times were relatively small (within 10\%). The horizontal axis is organized by total number of GB of data sorted across all processors. Due to the problem-size restriction of equation~\eqnref{eq:N-orig-restr}, threaded columnsort could not handle more than 4~GB of data. Results for threaded columnsort appear as single points in Figure~\ref{fig:experiment}. For a buffer size of $2^{25}$ bytes, the total time is just barely above the baseline 3-pass I/O time; in other words, threaded columnsort is almost purely I/O-bound. When we halve the buffer size, to $2^{24}$ bytes, there are more frequent switches between pipeline stages and so the I/O-boundedness diminishes somewhat. We found that with only one exception, larger buffer sizes resulted in faster execution. We could not make these buffers arbitrarily large because there is a limit on how large these buffers can get until demand paging starts slowing down the execution. Due to the power-of-$4$ restriction on $s$ in subblock columnsort, not all problem sizes were eligible to be run for a given value of~$r$ (i.e., for a given buffer size). That is why the two lines plotted for subblock columnsort in Figure~\ref{fig:experiment} cover disjoint problem sizes, and why each line covers problem sizes that differ by a factor of~$4$. With a buffer size of $2^{25}$ bytes, the running times are only slightly above the baseline 4-pass I/O time. (Recall that subblock columnsort has one pass more than threaded columnsort.) Thus, subblock columnsort is substantially I/O-bound. With the smaller buffer size of $2^{24}$ bytes, execution times increase for the same reason as in threaded columnsort. Observe that each of the subblock columnsort lines rises only slightly as the volume of data sorted increases, thus demonstrating our earlier claim that the volume of data per processor is the most salient characteristic in determining execution time. Figure~\ref{fig:experiment} shows one particular advantage of $M$-columnsort over the other two algorithms: we were able to run it at all four problem sizes. In fact, had sufficient disk space been available, we could have run $M$-columnsort on up to one terabyte total on 16 processors with $2^{25}$-byte buffers and 64-byte records. Execution times are well above the baseline 3-pass I/O time, and so $M$-columnsort is not nearly as I/O-bound as the other two algorithms. This fact is of course no surprise, since $M$-columnsort has a much more complicated in-core sort stage and also uses more memory (due to the extra buffers required by the additional threads). The one exception to the rule of larger buffer sizes being better was that for 4~GB of data, we found that $M$-columnsort ran faster with a buffer size of $2^{24}$ bytes than with $2^{25}$ bytes. We believe that this particular run for the $2^{25}$-byte buffer size was an anomaly, as it occurred during a time in which the disk space available on $M$-columnsort's nodes was barely more than $M$-columnsort required and the remaining nodes of the cluster were running a different program that performed heavy amounts of communication. In all our runs, $M$-columnsort was clearly superior to subblock columnsort. According to the performance results in Figure~\ref{fig:experiment}, $M$-columnsort was always at least as fast as subblock columnsort. The primary reason that $M$-columnsort was faster is that it makes only three passes over the data rather than four passes. For a given buffer size, subblock columnsort can handle only problem sizes that are powers of~$4$, whereas $M$-columnsort can handle any power-of-$2$ problem size. Furthermore, for most realistic configuration sizes, $M$-columnsort can sort larger problem sizes than subblock columnsort. By restrictions \eqnref{eq:N-rev-restr} and~\eqnref{eq:N-mpcol-restr}, $M$-columnsort can handle a larger number of records than subblock columnsort if $M^{3/2} / \sqrt{2} > (M/P)^{5/3} / 4^{2/3}$ or, equivalently, if $M < 32 P^{10}$. For example, if $P=8$ processors, then as long as the total memory in the system holds fewer than $2^{35}$ records, $M$-columnsort will sort more records than subblock columnsort. \section{Conclusion} \label{sec:conclusion} We have seen two ways to increase the problem-size bound in out-of-core columnsort. Subblock columnsort modifies the original columnsort algorithm by adding extra steps and altering the exponent in the height restriction. In the out-of-core implementation, subblock columnsort requires one additional pass. Like threaded columnsort, subblock columnsort's maximum problem size increases only with the amount of memory per processor (though the increase is superlinear). $M$-columnsort leaves the original columnsort algorithm intact, but it uses a different height interpretation in the out-of-core setting. By setting the column height to the amount of memory across the entire system, $M$-columnsort's maximum problem size increases superlinearly in the amount of memory in the system. Experiments on a Beowulf cluster show that both subblock columnsort and $M$-columnsort run well but that $M$-columnsort is superior. Moreover, $M$-columnsort handles a wider range of problem sizes than subblock columnsort. There are several directions for our future work: \begin{itemize} \item We plan to combine subblock columnsort and $M$-columnsort into one four-pass algorithm which has a problem-size bound of $N \leq M^{5/3} / 4^{2/3}$, i.e., restriction~\eqnref{eq:N-rev-restr} but with $M/P$ replaced by~$M$. This algorithm would have a larger problem-size bound than either subblock columnsort or $M$-columnsort alone. \item The closer the height interpretation is to $r = M/P$, the less communication overhead is incurred during the sort stages. We will develop an implementation that allows for values of~$r$ between $M/P$ and~$M$, depending on the problem size~$N$ for a given run. \item Our current implementations are I/O-bound, and their I/O bandwidth is below the sustainable rate of the disks. There is only so much that we can do about I/O rates while keeping the I/O interface at a reasonably high level. We do expect, however, to investigate memory-mapped I/O to eliminate unnecessary copying of data. \item We would like to eliminate the power-of-$2$ requirement from as many parameters as possible. \item There may be other algorithms for the in-core sort stages of $M$-columnsort that are superior to in-core columnsort. In particular, we will try a distribution-based sorting method. \end{itemize} \subheading{Acknowledgments} The Beowulf cluster upon which we performed experiments is located at the Institute for Security Technology Studies (ISTS) at Dartmouth College. We are grateful to Garry Davis, Michel Gray, and David Nicol for granting us access to the cluster and to the researchers at ISTS for accommodating our schedule. MPI/Pro is a commercial product from MPI Software Technology, Inc. We thank Tony Skjellum, David Leimbach, Rossen Dimitrov, Weiyi Chen, and Tracey Miller of MPI Software Technology, Inc., for assistance with installing and using MPI/Pro. \bibliography{/u/thc/IO/IO,/u/thc/mytex/papers} \end{document}