Class SuffixArray.Integer

java.lang.Object
org.bzdev.util.SuffixArray
org.bzdev.util.SuffixArray.Integer
Enclosing class:
SuffixArray

public static final class SuffixArray.Integer extends SuffixArray
Class providing a suffix array for int-valued sequences
  • Field Details

    • sequence

      protected int[] sequence
      The sequence used to compute this suffix array. It must not be modified after being set in a constructor.
  • Constructor Details

    • Integer

      public Integer(int[] sequence, int n)
      Constructor. The maximum reasonable value of n should be no higher than the length of the sequence and frequently much lower. The sequence must not be changed after this constructor is called as long as this suffix array is used.
      Parameters:
      sequence - a sequence of integers whose values are in the range [0,n)
      n - the size of an alphabet encoded as values in [0,n) where n must be a positive integer
      Throws:
      IllegalArgumentException - n is out of range.
    • Integer

      public Integer(int[] sequence, int[] sarray) throws IllegalArgumentException
      Constructor for precomputed suffix arrays. The maximum reasonable value of n should be no higher than the length of the sequence and frequently much lower. The sequence must not be changed after this constructor is called as long as this suffix array is used.

      The suffix array passed as the third argument must have one more element than the sequence.

      This constructor is intended for the case where a suffix array was previously computed and saved in a file. The caller must assure that the suffix array was computed for the sequence and that the alphabets are the same.

      Parameters:
      sequence - a sequence of integers whose values are in the range [0,n)
      sarray - the suffix array
      Throws:
      IllegalArgumentException - n is out of range or the suffix array is not compatible with the sequence.
  • Method Details

    • findInstance

      public int findInstance(int[] subsequence)
      Find the index into the sequence associated with a suffix array for an arbitrary instance of a subsequence.

      Using an LCP-LR table (created by calling SuffixArray.useLCPLR()) will change with time complexity of this method from O(m log n) to O(m + log n), where m is the length of a subsequence and n is the length of the sequence array. These, however, are worst-case numbers: while it can take m steps for a comparison function to determine that two suffixes differ, the comparison will stop at the first step at which the suffixes actually differ: the difference in running time in practice is data-set dependent.

      If the argument has a length of 0, the value returned is the length of the sequence.

      Parameters:
      subsequence - the subsequence.
      Returns:
      the index into the sequence; -1 if the subsequence cannot be found
      See Also:
    • findInstance

      public int findInstance(int[] subsequence, int start, int end)
      Find the index into the sequence associated with a suffix array for an arbitrary instance of a subsequence given a starting and ending index into an array containing the subsequence.

      Using an LCP-LR table (created by calling SuffixArray.useLCPLR()) will change with time complexity of this method from O(m log n) to O(m + log n), where m is the length of a subsequence and n is the length of the sequence array. These, however, are worst-case numbers: while it can take m steps for a comparison function to determine that two suffixes differ, the comparison will stop at the first step at which the suffixes actually differ: the difference in running time in practice is data-set dependent.

      If the argument has a length of 0, or the second and third arguments are equal, the value returned is the length of the sequence.

      Parameters:
      subsequence - array containing the subsequence.
      start - the starting index in the subsequence array (inclusive)
      end - the ending index in the subsequence array (exclusive)
      Returns:
      the index into the sequence; -1 if the subsequence cannot be found
      See Also:
    • findSubsequence

      public int findSubsequence(int[] subsequence)
      Find the suffix-array index of an arbitrary instance of a subsequence. This method finds the index into a suffix array corresponding to a subsequence.
      Parameters:
      subsequence - the subsequence.
      Returns:
      the index into the suffix array; -1 if the subsequence cannot be found
      See Also:
    • findSubsequence

      public int findSubsequence(int[] sarray, int start, int end)
      Find the suffix-array index of an arbitrary instance of a subsequence given a starting index and ending index for the subsequence. This method finds the index into a suffix array corresponding to a subsequence. The subsequence consists of the elements of the array sarray with a starting index named start and and ending index named end.
      Parameters:
      sarray - the subsequence array.
      start - the starting index in the subsequence array (inclusive)
      end - the ending index in the subsequence array (exclusive)
      Returns:
      the index into the suffix array; -1 if the subsequence cannot be found
      See Also:
    • findSubsequence

      public int findSubsequence(int[] subsequence, boolean keyflag)
      Find a subsequence. This method finds the index into a suffix array corresponding to a subsequence.
      Parameters:
      subsequence - the subsequence.
      keyflag - true if the highest index should be returned; false if the lowest index should be returned.
      Returns:
      the index into the suffix array; -1 if the subsequence cannot be found
      See Also:
    • findSubsequence

      public int findSubsequence(int[] sarray, int start, int end, boolean keyflag)
      Find a subsequence given a starting index and ending index for the subsequence. This method finds the index into a suffix array corresponding to a subsequence. The subsequence consists of the elements of the array sarray with a starting index named start and and ending index named end.
      Parameters:
      sarray - the subsequence array.
      start - the starting index in the subsequence array (inclusive)
      end - the ending index in the subsequence array (exclusive)
      keyflag - true if the highest index should be returned; false if the lowest index should be returned.
      Returns:
      the index into the suffix array; -1 if the subsequence cannot be found
      See Also:
    • findSubsequence

      public int findSubsequence(int offset, int[] sarray, int start, int end, int first, int last, boolean keyflag)
      Find a subsequence given a starting index and ending index for the subsequence, restricting the search to a range in this suffix array. This method finds the index into a suffix array corresponding to a subsequence. The subsequence consists of the elements of the array sarray with a starting index named start and and ending index named end. When the range provided by the fifth and sixth arguments is such that all suffixes in this range start with the same N symbols, the offset can be set to N so that these symbols are ignored in comparisions, in which case the third argument should also be set so that the subsequence does not contain these N symbols at its start.
      Parameters:
      offset - an offset to add to the index into a sequence provided by the suffix array when comparing values
      sarray - the subsequence array
      start - the starting index in the subsequence array (inclusive)
      end - the ending index in the subsequence array (exclusive)
      first - the starting index in the suffix array (inclusive)
      last - the ending index int he suffix array (exclusive)
      keyflag - true if the highest index should be returned; false if the lowest index should be returned.
      Returns:
      the index into the suffix array; -1 if the subsequence cannot be found
      See Also:
    • findRange

      public SuffixArray.Range findRange(int[] subsequence)
      Find all instances of a subsequence.
      Parameters:
      subsequence - the subsequence
      Returns:
      the subsequences corresponding to a range in the suffix array
      See Also:
    • findRange

      public SuffixArray.Range findRange(int[] sarray, int start, int end)
      Find all instances of a subsequence given a starting index and ending index.
      Parameters:
      sarray - the subsequence array.
      start - the starting index in the subsequence array (inclusive)
      end - the ending index in the subsequence array (exclusive)
      Returns:
      the subsequences corresponding to a range in the suffix array
      See Also:
    • getSequence

      public int[] getSequence()
      Get the sequence associated with this suffix array. The sequence is an array that must not be modified.
      Returns:
      the sequence that this suffix array describes
    • getBWT

      public int getBWT(int[] bwt)
      Get the Burrows-Wheeler Transform (BWT) of the sequence associated with this suffix array, with -1 indicating the end-of-text addition to the alphabet. The value returned includes the end-of-text symbol in the transform when the length of the array is one more than the length of the sequence associated with this suffix array.
      Parameters:
      bwt - the array to store the BWT (an array whose length is the length of the sequence if the end-of-text symbol does not appear in the BWT and one more than the length of the sequence if the end-of-text symbol does appear in the BWT)
      Returns:
      the index for the sorted permutation that matches the sequence
    • inverseBWT

      public static void inverseBWT(int[] bwt, int[] result, int index, int n) throws IllegalArgumentException
      Compute the inverse Burrows-Wheeler transform. When the length of of the BWT array is one more than the length of the result array, the BTW array is assumed to contain an end-of-text symbol (-1 for this case), and the index parameter is ignored. If the two arrays have the same length, all symbols in the BWT array must be in the alphabet and the index must be provided (it will be the value returned by a call to getBWT(int[])).
      Parameters:
      bwt - the Burrows-Wheeler transform
      result - the inverse of the Burrons-Wheeler transform
      index - the index parameter for the Burrows-Wheeler transform
      n - the size of the alphabet.
      Throws:
      IllegalArgumentException - bwt and result have inconsistent lengths
    • fillLCPArray

      protected void fillLCPArray(int[] ourlcpArray, int[] rank)
      Description copied from class: SuffixArray
      Fill the LCP array with the correct values. (the value at index 0 will be set outside of this method).

      A typical implementation would use the following code:

      
        protected void fillLCPArray(int[] ourlcpArray, int[] rank) {
            int k = 0;
            int n = sequence.length;
                  for (int i = 0; i < n; i++) {
                      if (rank[i] == n) {
                          k = 0;
                          continue;
                      }
                      int j = array[rank[i]+1];
                      while (i+k < n && j+k < n &&
                          sequence[i+k] == sequence[j+k]) {
                          k++;
                      }
                      ourlcpArray[rank[i]+1] = k;
                      if (k > 0) k--;
            }
              }
       
      Subclasses will have different types for the variable sequence.
      Specified by:
      fillLCPArray in class SuffixArray
      Parameters:
      ourlcpArray - the LCP array to fill
      rank - the inverse mapping for this suffix array
    • commonPrefixLength

      protected int commonPrefixLength(int index1, int index2)
      Description copied from class: SuffixArray
      Given two indices into the sequence array, compute the length of a common prefix.

      This method is abstract because the implementation is dependent on the type of sequence-array elements. It's implementation will typically compare the prefixes element by element until an index is found where the elements differ. The implementation, however, must not be dependent on the existence of either the LCP table or the LCP_LR table.

      Specified by:
      commonPrefixLength in class SuffixArray
      Parameters:
      index1 - the index into the suffix array for the first suffix
      index2 - the index into the suffix array for the second suffix
      Returns:
      the length of the LCP for these two suffixes