Class SuffixArray

java.lang.Object
org.bzdev.util.SuffixArray
Direct Known Subclasses:
SuffixArray.Array, SuffixArray.Byte, SuffixArray.Char, SuffixArray.Integer, SuffixArray.Short, SuffixArray.String, SuffixArray.UnsignedByte, SuffixArray.UnsignedShort

public abstract class SuffixArray extends Object
Suffix Array class. Suppose a sequence is represented by an array of integers with each element the member of an alphabet consisting of the values [0, n) for a fixed positive integer n. The alphabet can be augmented with a sentinel character that is lexically smaller than any member of the alphabet (the choice made for this class is -1 for most types and 0xFFFF for char, which is an unsigned type. The Unicode value 0xFFFF is interpreted as "not a character" and thus should not appear in any reasonable text). A suffix is a subsequence that starts at some index and continues to the end of the sequence.

A suffix array is an array of indices that indicate the index for the first element of each suffix, sorted into ascending lexical order. the first element represents an empty sequence, and its value is the length of the array (the element it points to is assumed to be the sentinel character).

Suffix arrays have numerous applications including string searching (e.g., exact matches and common substrings), and computational biology. Because of the importance of this data structure, a considerable amount of work has gone into finding efficient algorithms for generating suffix arrays. The following provides a list of citations for some of this research:

The coding examples and algorithm descriptions cited above cannot be used literally for the LCP computation because they make differing assumptions as to the format of the suffix array, specifically its first entry. Some authors start a suffix array with an index for a sentinel character that may not be literally present in the string: this character is considered not be part of the alphabet and is lexically smaller than each 'letter' in the alphabet. In addition, some descriptions of the algorithms start an array index at 1 while others start it at 0.

The SuffixArray class is an abstract class, with several subclasses defined as nested classes:

  • SuffixArray.Integer. The sequence is an array of integers.
  • SuffixArray.Char. The sequence is an array of characters.
  • SuffixArray.Short. The sequence is an array of short integers.
  • SuffixArray.UnsignedShort. The sequence is an array of short integers, treated as unsigned numbers(for a short h, the corresponding unsigned value is 0xFFFF & h).
  • SuffixArray.Byte. The sequence is an array of bytes.
  • SuffixArray.UnsignedByte. The sequence is an array of bytes, treated as unsigned numbers (for a byte b, the corresponding unsigned value is 0xFF & b).
  • SuffixArray.String. The sequence is a string, If an alphabet is provided, its type must be Set<Character>.
  • SuffixArray.Array. The sequence is an array of objects of type T, where T is a type parameter. An alphabet is required and its type must be Set<T>.
When an explicit alphabet is used it will be represented as a Set, and will be mapped to a sequence of integers, starting at 0, and in the order implied by the set's iterator. To force the use of a particular order, use LinkedHashSet or TreeSet. Regardless of how the alphabet is defined, the algorithm used to construct the suffix array allocates and uses arrays whose sizes are equal to the size of the alphabet, so keeping the alphabet to the minimum size needed is advantageous.

In this package, each subclass of SuffixArray has methods that return the sequence associated with the suffix array. These methods are not provided by the SuffixArray class itself because the type of the value returned differs between subclasses.

The methods getArray() and getLCP() can be used to get arrays that can be used to simulate the traversal of a suffix tree. The method getInverse() allows one to map a suffix (represented as an index into the sequence array or string) into the corresponding index in the suffix array.

Subclasses provide additional methods in cases where a method's signature is dependent on types (especially primitive types) associated with that subclass. These classes to not have corresponding abstract methods in SuffixArray. For the current package, several such methods are provided by all the subclasses of SuffixArray:

  • getSequence. This method will return the sequence used to create a suffix array.
  • getBWT. This method will compute the Burrows-Wheeler transform for this suffix array. Aside from its use in data compression, some search algorithms use this transform.
  • findInstance. This method (there are two variants) will find an offset into the sequence array where the offset matches the specified subsequence. This is a convenience method that uses findSubsequence to look up an index in the suffix array, but returns the corresponding index into the sequence instead.
  • findRange. This method (there are two variants) will find a a range of offsets into the sequence array where each offset matches the specified subsequence.
  • findSubsequence. This method (there are three variants) allows instances of a subsequence to be found efficiently. For the case without a boolean argument, where an arbitrary instance of the subsequence is returned, the time complexity depends on whether or not LCP arrays are used: if n is the length of the sequence and m is the length of a subsequence, the time complexity is O(m log n) if the LCP-LR auxiliary data structure is not used and is O(m + log n) if the LCP-LR auxiliary data structure is used. One variant allows one to search just a portion of a suffix array. For example, when searching for a number of different HTML tags, one can first search for '<' to find the indicies in the suffix array for all suffixes that start with '<' and then search that subrange for the remainder of the tag, which can improve performance as the subsequent binary searches will be faster.
  • inverseBWT. This method (which is a static method) computes the inverse Burrows-Wheeler transform.

Auxiliary tables

The implementation supports three auxiliary tables:

  • An inverse table. This table is implemented as an array with the same length as the suffix array. A copy of this table can be created by calling getInverse(). To create this table and cache it, call useInverse(). To remove the cached table, call clearCachedInverse(). To check if a cached table exists, call hasInverse().
  • an LCP (Longest Common Prefix) table. This table is implemented as an array with the same length as the suffix array. A copy of this table can be created by calling getLCP(). To create this table and cache it, call useLCP(). To remove the cached table, call clearCachedLCP(). To check if a cached LCP table exists, call hasLCP(). Creating the LCP table requires the inverse table, so if the index table will be used after the LCP table is created, call useInverse() before calling useLCP() or getLCP().
  • an LCP-LR table. This contains a left and right LCP table for a binary search case and is used to speed up some computations. The table consists of two protected fields, both arrays and each as large as the corresponding suffix array. To create this table, call useLCPLR(). To remove the table, call clearCachedLCPLR(). To check if the table exists, call hasLCPLR().
One should consider removing the inverse, LCP, and LCP-LR auxiliary tables when they are no longer needed, particularly if the suffix array is large, given the size of these tables.
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static final class 
    Class providing a suffix array for Object sequences.
    static final class 
    Class providing a suffix array for byte-valued sequences.
    static final class 
    Class providing a suffix array for char-valued sequences
    static final class 
    Class providing a suffix array for int-valued sequences
    static class 
    Iterator for a suffix array.
    static interface 
    Interface representing a range in a suffix array resulting from a search for all instances of a subsequence.
    static final class 
    Class providing a suffix array for short-valued sequences
    static final class 
    Class providing a suffix array for String sequences
    static class 
    Class providing a suffix array for unsigned-byte-valued sequences Java bytes are signed.
    static final class 
    Class providing a suffix array for short-valued sequences
    static final class 
    Class providing a suffix array for byte arrays containing UTF-8 encoded characters.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected int[]
    The suffix array.
    protected int[]
    The LCP_L table.
    protected int[]
    The LCP_R table.
    protected int[]
    The Longest Common Prefix (LCP) length array.
    protected int[]
    The inverse of the suffix array.
    protected int
    One higher than the length of the sequence array.
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    Clear the cached inverse mapping.
    void
    Clear the cached least-common-prefix array.
    void
    Remove the cached LCP-LR table, if any, associated with this suffix array.
    protected abstract int
    commonPrefixLength(int index1, int index2)
    Given two indices into the sequence array, compute the length of a common prefix.
    long
    Count the number of unique subsequences.
    protected abstract void
    fillLCPArray(int[] ourlcpArray, int[] rank)
    Fill the LCP array with the correct values.
    int[]
    Get the suffix array.
    int[]
    Get the inverse mapping for this suffix array.
    int[]
    Get the LCP array corresponding to this suffix array.
    boolean
    Test if the inverse array is currently cached.
    boolean
    Test if the LCP array is currently cached.
    boolean
    Determine if this suffix array has an LCP-LR table associated with it.
    int
    lcpLength(int index1, int index2)
    Get the length of the longest common prefix (LCP) for two suffixes.
    Get an iterator that will iterator that will provide a sequence of indices and lengths for each subsequence of the sequence corresponding to this suffix array.
    void
    Configure this suffix array so that it has an inverse mapping.
    void
    Configure this suffix array to use an LCP (Longest Common Prefix) table.
    void
    Configure a suffix array to use an LCP-LR table.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • slenp1

      protected int slenp1
      One higher than the length of the sequence array. Subclasses must set this field.
    • array

      protected int[] array
      The suffix array. Subclasses must set this field.
    • lcpArray

      protected int[] lcpArray
      The Longest Common Prefix (LCP) length array. Subclasses may read this field but should not set it or alter it.
    • rank

      protected int[] rank
      The inverse of the suffix array. Subclasses may read this.
    • LCP_L

      protected int[] LCP_L
      The LCP_L table.
      See Also:
    • LCP_R

      protected int[] LCP_R
      The LCP_R table.
      See Also:
  • Constructor Details

    • SuffixArray

      public SuffixArray()
  • Method Details

    • hasInverse

      public boolean hasInverse()
      Test if the inverse array is currently cached.
      Returns:
      true if the inverse array is cached; false if it has to be recomputed
    • useInverse

      public void useInverse()
      Configure this suffix array so that it has an inverse mapping. This inverse mapping maps the offset into a sequence to the corresponding offset into the suffix array. Calling this method will result in the inverse table being cached. To clear the cached inverse table, call clearCachedInverse()
    • getInverse

      public int[] getInverse()
      Get the inverse mapping for this suffix array. The inverse mapping is an array whose index is the index in the sequence array and whose value is the corresponding index in the suffix array.
      Returns:
      the inverse mapping
    • clearCachedInverse

      public void clearCachedInverse()
      Clear the cached inverse mapping. This can be called to reduce memory usage if all external references to the mapping have been removed. The mapping can be restored if needed, but the time complexity is O(n) where n is the length of the sequence.
    • clearCachedLCP

      public void clearCachedLCP()
      Clear the cached least-common-prefix array. This can be called to reduce memory usage if all external references to the LCP length array have been removed. The array can be restored if needed, but the time complexity is O(n) where n is the length of the sequence.
    • fillLCPArray

      protected abstract void fillLCPArray(int[] ourlcpArray, int[] rank)
      Fill the LCP array with the correct values. (the value at index 0 will be set outside of this method).

      A typical implementation would use the following code:

      
        protected void fillLCPArray(int[] ourlcpArray, int[] rank) {
            int k = 0;
            int n = sequence.length;
                  for (int i = 0; i < n; i++) {
                      if (rank[i] == n) {
                          k = 0;
                          continue;
                      }
                      int j = array[rank[i]+1];
                      while (i+k < n && j+k < n &&
                          sequence[i+k] == sequence[j+k]) {
                          k++;
                      }
                      ourlcpArray[rank[i]+1] = k;
                      if (k > 0) k--;
            }
              }
       
      Subclasses will have different types for the variable sequence.
      Parameters:
      ourlcpArray - the LCP array to fill
      rank - the inverse mapping for this suffix array
    • hasLCP

      public boolean hasLCP()
      Test if the LCP array is currently cached.
      Returns:
      true if the LCP array is cached; false if it has to be recomputed
    • useLCP

      public void useLCP()
      Configure this suffix array to use an LCP (Longest Common Prefix) table. Calling this method will cause the LCP table to be cached. To remove the cached table, call clearCachedLCP().
    • getLCP

      public int[] getLCP()
      Get the LCP array corresponding to this suffix array. An LCP (Longest Common Prefix) array provides the length of the longest common prefix for the subsequence represented by an index i in a suffix array and the subsequence at index (i-1). A value of -1 indicates that no value exists (this will be true at index 0, as there is no subsequence at index -1).

      NOTE: the algorithm used to create the LCP table requires the inverse array (returned by getInverse(). If the inverse mapping will be used later, one should call useInverse() before this method is called to avoid creating the inverse mapping multiple times.

    • hasLCPLR

      public boolean hasLCPLR()
      Determine if this suffix array has an LCP-LR table associated with it.
      Returns:
      true if there is an LCP-LR table; false otherwise
    • clearCachedLCPLR

      public void clearCachedLCPLR()
      Remove the cached LCP-LR table, if any, associated with this suffix array. This can be called to reduce memory usage. The table can be restored if needed, but the time complexity is O(n) where n is the length of the sequence.
    • useLCPLR

      public void useLCPLR()
      Configure a suffix array to use an LCP-LR table. This table contains two entries per offset into the suffix array. The 'left' entry is the minimum LCP value for a range of values lower than the offset, and the 'right' entry is the minimum LCP value for a range of values higher than the offset. The length of these ranges are such that the offset will be the mid point between a 'low' and 'high' value during a binary search.

      This table is implemented as two integer arrays, each the same size as an LCP array. For long sequences, a significant amount of memory will be required.

      Calling this method will result in the LCP-LR table being cached. To clear the cached table, call clearCachedLCPLR(). In addition, this method uses an LCP array. The method useLCP() should be called before a call to this method if the LCP array will be used later: otherwise the LCP array will be recomputed.

    • commonPrefixLength

      protected abstract int commonPrefixLength(int index1, int index2)
      Given two indices into the sequence array, compute the length of a common prefix.

      This method is abstract because the implementation is dependent on the type of sequence-array elements. It's implementation will typically compare the prefixes element by element until an index is found where the elements differ. The implementation, however, must not be dependent on the existence of either the LCP table or the LCP_LR table.

      Parameters:
      index1 - the index into the suffix array for the first suffix
      index2 - the index into the suffix array for the second suffix
      Returns:
      the length of the LCP for these two suffixes
    • lcpLength

      public int lcpLength(int index1, int index2)
      Get the length of the longest common prefix (LCP) for two suffixes. The LCP is a substring, beginning at the start of each suffix, that is shared by two suffixes. Each index is the offset into the suffix array, not the sequence.

      The algorithm used depends on how this suffix array is configured:

      • if this suffix array has an LCP-LR table, an algorithm whose running time is comparable to that for a binary search is used. When the LCP length is small, a direct comparison is faster. The LCP-LR table provides more predictable performance, particularly in cases where long substrings are repeated.
      • if there is an LCP table, but no LCP-LR table, the minimum LCP value between two offsets into the suffix array will be computed, but if the difference between the offsets is larger than the LCP value, the two suffixes are compared directly.
      • if there is no LCP table and no LCP-LR table, the two suffixes are compared directly. I.e., the suffixes are scanned until a mismatch is found or the end of a suffix is reached.
      To configure the suffix array so that the LCP-LR table is used, call useLCPLR(). To configure the suffix array so that the LCP table may be used (e.g., when the LCP-LR table exists), call useLCP().
      Parameters:
      index1 - the index into the suffix array for the first suffix
      index2 - the index into the suffix array for the second suffix
      Returns:
      the length of the LCP for these two suffixes
      See Also:
    • getArray

      public int[] getArray()
      Get the suffix array. The array returned will contain the index for each suffix as it would appear in lexical order in which shortest suffixes appear first when the matching elements are identical. The first entry is for an empty string and contains the length of the sequence array.

      The array returned must not be modified.

      Returns:
      the suffix array
    • subsequences

      public SuffixArray.Iterator subsequences()
      Get an iterator that will iterator that will provide a sequence of indices and lengths for each subsequence of the sequence corresponding to this suffix array. Each subsequence is unique. An empty subsequence is not included in the iteration.

      The iterator's SuffixArray.Iterator.next() method will return the index into the sequence for the start of a subsequence. If the method SuffixArray.Iterator.getLength() is called, the iterator will provide the length of the corresponding subsequence. If SuffixArray.Iterator.next() has not been called, the length is undefined, and SuffixArray.Iterator.getLength() will return 0.

      Returns:
      the iterator
    • countUniqueSubsequences

      public long countUniqueSubsequences()
      Count the number of unique subsequences.

      Note: the implementation (unless overridden) will create an LCP array, temporarily if the useLCP() has not been called. If the LCP array will be need subsequently, one should call useLCP() before this method is called to avoid duplicating the work needed to set up the LCP array.

      Returns:
      the number of subsequences.