Suffix arrays

A suffix array is a simple data structure that enables lookup of any substring of a text and identification of repeated substrings. It is more compact than a suffix tree and is amenable to storage in secondary memory.

A suffix array is (notionally) a sorted list of the suffixes of a given string. (A suffix is a substring that extends to the end of the given string.) The sorted list is presented as an array of integers that identify the suffixes in order. The suffix array for the 11-letter text ABRACADABRA, with an endmark # that collates low, is given below, together with the corresponding suffixes.

11  #
10  A#
 7  ABRA#
 0  ABRACADABRA#
 3  ACADABRA#
 5  ADABRA#
 8  BRA#
 1  BRACADABRA#
 4  CADABRA#
 6  DABRA#
 9  RA#
 2  RACADABRA#
Together, the suffix array and original string enable binary search for any substring, e.g. CAD.

A useful auxiliary data structure is an `LCP array', an array of lengths of the longest common prefix between each substring and its predecessor in the suffix array. The second column below gives the LCP array for the previous example. The third column shows that a suffix array may also be interpreted as an ordered list of circular shifts.

11  0  #ABRACADABRA
10  0  A#ABRACADABR
 7  1  ABRA#ABRACAD
 0  4  ABRACADABRA#
 3  1  ACADABRA#ABR
 5  1  ADABRA#ABRAC
 8  0  BRA#ABRACADA
 1  3  BRACADABRA#A
 4  0  CADABRA#ABRA
 6  0  DABRA#ABRACA
 9  0  RA#ABRACADAB
 2  2  RACADABRA#AB
The last column of letters, ARD#RCAAAABB, is Wheeler's transform, which is central to B-W data compression. An introduction to its use may be found in the Wikipedia article, "Burrows-Wheeler Transform".

Suffix-array functions

The archive sarray.tar contains C functions that compute suffix arrays efficiently by methods that build on work of Manber and Myers. The functions handle strings of integers, which may contain codes for characters, words, or other types of data. Contents:
sarray.h
Header file.
ssarray.c
A simplification of the Myers/Manber technique, by Peter McIlroy and Douglas McIlroy. This is the program to study, but sarray.c is the one for heavy lifting.
sarray.c
An extensively engineered hybrid of ssarray.c and quicksort, usually more than five times as fast as ssarray.c, by Sean Quinlan and Sean Dorward. Also contains a specialized interface, bsarray, tailored to byte-string data. A similar program appeared in N. Jesper Larsson's PhD thesis.
lcp.c
Computes an LCP array for a given suffix array, by a linear-time method due to Kasai et al.
scode.c
Encodes a string into a canonical form for input to ssarray or sarray.
SuffixArray.java and SuffixArray.c
Java interface to the C functions.
sarray.3
Unix-style man page, troff source, also available in gzipped PostScript, and PDF.
Updated April 6, 2010