- Direct Known Subclasses:
SuffixArray.Array
,SuffixArray.Byte
,SuffixArray.Char
,SuffixArray.Integer
,SuffixArray.Short
,SuffixArray.String
,SuffixArray.UnsignedByte
,SuffixArray.UnsignedShort
A suffix array is an array of indices that indicate the index for the first element of each suffix, sorted into ascending lexical order. the first element represents an empty sequence, and its value is the length of the array (the element it points to is assumed to be the sentinel character).
Suffix arrays have numerous applications including string searching (e.g., exact matches and common substrings), and computational biology. Because of the importance of this data structure, a considerable amount of work has gone into finding efficient algorithms for generating suffix arrays. The following provides a list of citations for some of this research:
- Ge Nong, Sen Zhang, and Wai Hong Chan, "Linear suffix array construction by almost pure induced-sorting", Proceeding of the 2009 Data Compression Conference, Pages 193-202, IEEE Computer Society (ISBN: 978-0-7695-3592-0 doi>10.1109/DCC.2009.42) contains a description of an efficient algorithm for creating suffix arrays.
- Ge Nong, Sen Zhang, and Wai Hong Chan, "Two efficient algorithms for linear time suffix array construction"
- A walk through the SA-IS Suffix Array Construction Algorithm provides a Python implementation of the algorithm and is a useful starting point for an implementation.
- Toru Kasai, Gunho Lee , Hiroki Arimur, Setsuo Arikawa, and Kunsoo Park, "Linear-time longest-common-prefix computation in suffix arrays and its applications" provides an algorithm for computing an LCP array (Longest Common Prefix).
- LCP Array Construction and LCP from suffix array has some coding examples.
The SuffixArray class is an abstract class, with several subclasses defined as nested classes:
- SuffixArray.Integer. The sequence is an array of integers.
- SuffixArray.Char. The sequence is an array of characters.
- SuffixArray.Short. The sequence is an array of short integers.
- SuffixArray.UnsignedShort. The sequence is an array of short integers, treated as unsigned numbers(for a short h, the corresponding unsigned value is 0xFFFF & h).
- SuffixArray.Byte. The sequence is an array of bytes.
- SuffixArray.UnsignedByte. The sequence is an array of bytes, treated as unsigned numbers (for a byte b, the corresponding unsigned value is 0xFF & b).
- SuffixArray.String. The sequence is a string, If an alphabet is provided, its type must be Set<Character>.
- SuffixArray.Array. The sequence is an array of objects of type T, where T is a type parameter. An alphabet is required and its type must be Set<T>.
Set
, and will be mapped to a sequence of
integers, starting at 0, and in the order implied by the set's iterator.
To force the use of a particular order, use LinkedHashSet
or TreeSet
.
Regardless of how the alphabet is defined, the algorithm used to
construct the suffix array allocates and uses arrays whose sizes
are equal to the size of the alphabet, so keeping the alphabet to
the minimum size needed is advantageous.
In this package, each subclass of SuffixArray has methods that return the sequence associated with the suffix array. These methods are not provided by the SuffixArray class itself because the type of the value returned differs between subclasses.
The methods getArray()
and getLCP()
can be used
to get arrays that can be used to simulate the traversal of
a suffix tree. The method getInverse()
allows one to map
a suffix (represented as an index into the sequence array or string)
into the corresponding index in the suffix array.
Subclasses provide additional methods in cases where a method's signature
is dependent on types (especially primitive types) associated with that
subclass. These classes to not have corresponding abstract methods in
SuffixArray
. For the current package, several such methods are
provided by all the subclasses of SuffixArray
:
- getSequence. This method will return the sequence used to create a suffix array.
- getBWT. This method will compute the Burrows-Wheeler transform for this suffix array. Aside from its use in data compression, some search algorithms use this transform.
- findInstance. This method (there are two variants) will find an offset into the sequence array where the offset matches the specified subsequence. This is a convenience method that uses findSubsequence to look up an index in the suffix array, but returns the corresponding index into the sequence instead.
- findRange. This method (there are two variants) will find a a range of offsets into the sequence array where each offset matches the specified subsequence.
- findSubsequence. This method (there are three variants) allows instances of a subsequence to be found efficiently. For the case without a boolean argument, where an arbitrary instance of the subsequence is returned, the time complexity depends on whether or not LCP arrays are used: if n is the length of the sequence and m is the length of a subsequence, the time complexity is O(m log n) if the LCP-LR auxiliary data structure is not used and is O(m + log n) if the LCP-LR auxiliary data structure is used. One variant allows one to search just a portion of a suffix array. For example, when searching for a number of different HTML tags, one can first search for '<' to find the indicies in the suffix array for all suffixes that start with '<' and then search that subrange for the remainder of the tag, which can improve performance as the subsequent binary searches will be faster.
- inverseBWT. This method (which is a static method) computes the inverse Burrows-Wheeler transform.
Auxiliary tables
The implementation supports three auxiliary tables:
- An inverse table. This table is implemented as an array
with the same length as the suffix array.
A copy of this table can be created by
calling
getInverse()
. To create this table and cache it, calluseInverse()
. To remove the cached table, callclearCachedInverse()
. To check if a cached table exists, callhasInverse()
. - an LCP (Longest Common Prefix) table. This table is implemented
as an array with the same length as the suffix array. A copy
of this table can be created by calling
getLCP()
. To create this table and cache it, calluseLCP()
. To remove the cached table, callclearCachedLCP()
. To check if a cached LCP table exists, callhasLCP()
. Creating the LCP table requires the inverse table, so if the index table will be used after the LCP table is created, calluseInverse()
before callinguseLCP()
orgetLCP()
. - an LCP-LR table. This contains a left and right LCP table for
a binary search case and is used to speed up some computations.
The table consists of two protected fields, both arrays and
each as large as the corresponding suffix array. To create
this table, call
useLCPLR()
. To remove the table, callclearCachedLCPLR()
. To check if the table exists, callhasLCPLR()
.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final class
Class providing a suffix array for Object sequences.static final class
Class providing a suffix array for byte-valued sequences.static final class
Class providing a suffix array for char-valued sequencesstatic final class
Class providing a suffix array for int-valued sequencesstatic class
Iterator for a suffix array.static interface
Interface representing a range in a suffix array resulting from a search for all instances of a subsequence.static final class
Class providing a suffix array for short-valued sequencesstatic final class
Class providing a suffix array for String sequencesstatic class
Class providing a suffix array for unsigned-byte-valued sequences Java bytes are signed.static final class
Class providing a suffix array for short-valued sequencesstatic final class
Class providing a suffix array for byte arrays containing UTF-8 encoded characters. -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected int[]
The suffix array.protected int[]
The LCP_L table.protected int[]
The LCP_R table.protected int[]
The Longest Common Prefix (LCP) length array.protected int[]
The inverse of the suffix array.protected int
One higher than the length of the sequence array. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoid
Clear the cached inverse mapping.void
Clear the cached least-common-prefix array.void
Remove the cached LCP-LR table, if any, associated with this suffix array.protected abstract int
commonPrefixLength
(int index1, int index2) Given two indices into the sequence array, compute the length of a common prefix.long
Count the number of unique subsequences.protected abstract void
fillLCPArray
(int[] ourlcpArray, int[] rank) Fill the LCP array with the correct values.int[]
getArray()
Get the suffix array.int[]
Get the inverse mapping for this suffix array.int[]
getLCP()
Get the LCP array corresponding to this suffix array.boolean
Test if the inverse array is currently cached.boolean
hasLCP()
Test if the LCP array is currently cached.boolean
hasLCPLR()
Determine if this suffix array has an LCP-LR table associated with it.int
lcpLength
(int index1, int index2) Get the length of the longest common prefix (LCP) for two suffixes.Get an iterator that will iterator that will provide a sequence of indices and lengths for each subsequence of the sequence corresponding to this suffix array.void
Configure this suffix array so that it has an inverse mapping.void
useLCP()
Configure this suffix array to use an LCP (Longest Common Prefix) table.void
useLCPLR()
Configure a suffix array to use an LCP-LR table.
-
Field Details
-
slenp1
protected int slenp1One higher than the length of the sequence array. Subclasses must set this field. -
array
protected int[] arrayThe suffix array. Subclasses must set this field. -
lcpArray
protected int[] lcpArrayThe Longest Common Prefix (LCP) length array. Subclasses may read this field but should not set it or alter it. -
rank
protected int[] rankThe inverse of the suffix array. Subclasses may read this. -
LCP_L
protected int[] LCP_LThe LCP_L table.- See Also:
-
LCP_R
protected int[] LCP_RThe LCP_R table.- See Also:
-
-
Constructor Details
-
SuffixArray
public SuffixArray()
-
-
Method Details
-
hasInverse
public boolean hasInverse()Test if the inverse array is currently cached.- Returns:
- true if the inverse array is cached; false if it has to be recomputed
-
useInverse
public void useInverse()Configure this suffix array so that it has an inverse mapping. This inverse mapping maps the offset into a sequence to the corresponding offset into the suffix array. Calling this method will result in the inverse table being cached. To clear the cached inverse table, callclearCachedInverse()
-
getInverse
public int[] getInverse()Get the inverse mapping for this suffix array. The inverse mapping is an array whose index is the index in the sequence array and whose value is the corresponding index in the suffix array.- Returns:
- the inverse mapping
-
clearCachedInverse
public void clearCachedInverse()Clear the cached inverse mapping. This can be called to reduce memory usage if all external references to the mapping have been removed. The mapping can be restored if needed, but the time complexity is O(n) where n is the length of the sequence. -
clearCachedLCP
public void clearCachedLCP()Clear the cached least-common-prefix array. This can be called to reduce memory usage if all external references to the LCP length array have been removed. The array can be restored if needed, but the time complexity is O(n) where n is the length of the sequence. -
fillLCPArray
protected abstract void fillLCPArray(int[] ourlcpArray, int[] rank) Fill the LCP array with the correct values. (the value at index 0 will be set outside of this method).A typical implementation would use the following code:
Subclasses will have different types for the variableprotected void fillLCPArray(int[] ourlcpArray, int[] rank) { int k = 0; int n = sequence.length; for (int i = 0; i < n; i++) { if (rank[i] == n) { k = 0; continue; } int j = array[rank[i]+1]; while (i+k < n && j+k < n && sequence[i+k] == sequence[j+k]) { k++; } ourlcpArray[rank[i]+1] = k; if (k > 0) k--; } }
sequence
.- Parameters:
ourlcpArray
- the LCP array to fillrank
- the inverse mapping for this suffix array
-
hasLCP
public boolean hasLCP()Test if the LCP array is currently cached.- Returns:
- true if the LCP array is cached; false if it has to be recomputed
-
useLCP
public void useLCP()Configure this suffix array to use an LCP (Longest Common Prefix) table. Calling this method will cause the LCP table to be cached. To remove the cached table, callclearCachedLCP()
. -
getLCP
public int[] getLCP()Get the LCP array corresponding to this suffix array. An LCP (Longest Common Prefix) array provides the length of the longest common prefix for the subsequence represented by an index i in a suffix array and the subsequence at index (i-1). A value of -1 indicates that no value exists (this will be true at index 0, as there is no subsequence at index -1).NOTE: the algorithm used to create the LCP table requires the inverse array (returned by
getInverse()
. If the inverse mapping will be used later, one should calluseInverse()
before this method is called to avoid creating the inverse mapping multiple times. -
hasLCPLR
public boolean hasLCPLR()Determine if this suffix array has an LCP-LR table associated with it.- Returns:
- true if there is an LCP-LR table; false otherwise
-
clearCachedLCPLR
public void clearCachedLCPLR()Remove the cached LCP-LR table, if any, associated with this suffix array. This can be called to reduce memory usage. The table can be restored if needed, but the time complexity is O(n) where n is the length of the sequence. -
useLCPLR
public void useLCPLR()Configure a suffix array to use an LCP-LR table. This table contains two entries per offset into the suffix array. The 'left' entry is the minimum LCP value for a range of values lower than the offset, and the 'right' entry is the minimum LCP value for a range of values higher than the offset. The length of these ranges are such that the offset will be the mid point between a 'low' and 'high' value during a binary search.This table is implemented as two integer arrays, each the same size as an LCP array. For long sequences, a significant amount of memory will be required.
Calling this method will result in the LCP-LR table being cached. To clear the cached table, call
clearCachedLCPLR()
. In addition, this method uses an LCP array. The methoduseLCP()
should be called before a call to this method if the LCP array will be used later: otherwise the LCP array will be recomputed. -
commonPrefixLength
protected abstract int commonPrefixLength(int index1, int index2) Given two indices into the sequence array, compute the length of a common prefix.This method is abstract because the implementation is dependent on the type of sequence-array elements. It's implementation will typically compare the prefixes element by element until an index is found where the elements differ. The implementation, however, must not be dependent on the existence of either the LCP table or the LCP_LR table.
- Parameters:
index1
- the index into the suffix array for the first suffixindex2
- the index into the suffix array for the second suffix- Returns:
- the length of the LCP for these two suffixes
-
lcpLength
public int lcpLength(int index1, int index2) Get the length of the longest common prefix (LCP) for two suffixes. The LCP is a substring, beginning at the start of each suffix, that is shared by two suffixes. Each index is the offset into the suffix array, not the sequence.The algorithm used depends on how this suffix array is configured:
- if this suffix array has an LCP-LR table, an algorithm whose running time is comparable to that for a binary search is used. When the LCP length is small, a direct comparison is faster. The LCP-LR table provides more predictable performance, particularly in cases where long substrings are repeated.
- if there is an LCP table, but no LCP-LR table, the minimum LCP value between two offsets into the suffix array will be computed, but if the difference between the offsets is larger than the LCP value, the two suffixes are compared directly.
- if there is no LCP table and no LCP-LR table, the two suffixes are compared directly. I.e., the suffixes are scanned until a mismatch is found or the end of a suffix is reached.
useLCPLR()
. To configure the suffix array so that the LCP table may be used (e.g., when the LCP-LR table exists), calluseLCP()
.- Parameters:
index1
- the index into the suffix array for the first suffixindex2
- the index into the suffix array for the second suffix- Returns:
- the length of the LCP for these two suffixes
- See Also:
-
getArray
public int[] getArray()Get the suffix array. The array returned will contain the index for each suffix as it would appear in lexical order in which shortest suffixes appear first when the matching elements are identical. The first entry is for an empty string and contains the length of the sequence array.The array returned must not be modified.
- Returns:
- the suffix array
-
subsequences
Get an iterator that will iterator that will provide a sequence of indices and lengths for each subsequence of the sequence corresponding to this suffix array. Each subsequence is unique. An empty subsequence is not included in the iteration.The iterator's
SuffixArray.Iterator.next()
method will return the index into the sequence for the start of a subsequence. If the methodSuffixArray.Iterator.getLength()
is called, the iterator will provide the length of the corresponding subsequence. IfSuffixArray.Iterator.next()
has not been called, the length is undefined, andSuffixArray.Iterator.getLength()
will return 0.- Returns:
- the iterator
-
countUniqueSubsequences
public long countUniqueSubsequences()Count the number of unique subsequences.Note: the implementation (unless overridden) will create an LCP array, temporarily if the
useLCP()
has not been called. If the LCP array will be need subsequently, one should calluseLCP()
before this method is called to avoid duplicating the work needed to set up the LCP array.- Returns:
- the number of subsequences.
-