- Enclosing class:
- SuffixArray
-
Nested Class Summary
Nested classes/interfaces inherited from class org.bzdev.util.SuffixArray
SuffixArray.Array<T>, SuffixArray.Byte, SuffixArray.Char, SuffixArray.Integer, SuffixArray.Iterator, SuffixArray.Range, SuffixArray.Short, SuffixArray.String, SuffixArray.UnsignedByte, SuffixArray.UnsignedShort, SuffixArray.UTF
-
Field Summary
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected int
commonPrefixLength
(int index1, int index2) Given two indices into the sequence array, compute the length of a common prefix.protected void
fillLCPArray
(int[] ourlcpArray, int[] rank) Fill the LCP array with the correct values.int
findInstance
(String subsequence) Find the index into the sequence associated with a suffix array for an arbitrary instance of a subsequence.int
findInstance
(String subsequence, int start, int end) Find the index into the sequence associated with a suffix array for an arbitrary instance of a subsequence given a starting and ending index into a string containing the subsequence.Find all instances of a subsequence.Find all instances of a subsequence given a starting index and ending index.int
findSubsequence
(int offset, String subsequence, int start, int end, int first, int last, boolean keyflag) Find a subsequence given a starting index and ending index for the subsequence, restricting the search to a range in this suffix array.int
findSubsequence
(String subsequence) Find the suffix-array index of an arbitrary instance of a subsequence.int
findSubsequence
(String subsequence, boolean keyflag) Find a subsequence.int
findSubsequence
(String subsequence, int start, int end) Find the suffix-array index of an arbitrary instance of a subsequence given a starting index and ending index for the subsequence.int
findSubsequence
(String subsequence, int start, int end, boolean keyflag) Find a subsequence given a starting index and ending index for the subsequence.int
getBWT
(char[] bwt) Get the Burrows-Wheeler Transform (BWT) of the sequence associated with this suffix array, with 0xffff indicating the end-of-text addition to the alphabet.Get the sequence associated with this suffix array.char[]
Get the char array representing this object's sequence.static void
inverseBWT
(char[] bwt, char[] result, int index, int n) Compute the inverse Burrows-Wheeler transform.Methods inherited from class org.bzdev.util.SuffixArray
clearCachedInverse, clearCachedLCP, clearCachedLCPLR, countUniqueSubsequences, getArray, getInverse, getLCP, hasInverse, hasLCP, hasLCPLR, lcpLength, subsequences, useInverse, useLCP, useLCPLR
-
Constructor Details
-
String
Constructor given an alphabet.The alphabet will be mapped to a sequence of integers, starting at 0, and in the order implied by the set's iterator. To force the use of a particular order, use
LinkedHashSet
orTreeSet
.Note: using this constructor will result in allocation a character array whose size is equal to the length of the string provided by the first argument. If the string contains characters not in the alphabet, the alphabet will be automatically extended.
- Parameters:
string
- a string representing a sequence of characters.alphabet
- the characters that are used in the string- Throws:
IllegalArgumentException
- a character in the string was not in the alphabetNullPointerException
- the second argument was null
-
String
Constructor given an alphabet size. The maximum reasonable value of n should be no higher than the length of the sequence and frequently much lower. The value of each character in the string must be in the range [0, n).- Parameters:
string
- a string representing a sequence of characters.n
- the size of an alphabet encoded as values in [0,n) where n is positive and no larger than 0xFFFF (216 - 2)- Throws:
IllegalArgumentException
- n is out of range.
-
String
Constructor given an alphabet size and initial character count. The initial n entries in the alphabet will the integer value of characters whose numeric values are in the set [0,n) in lexical order. If the extend argument is true, the value of n will be increased with additional characters having a value that depends on their order of occurrence in the string.As an example of when setting the extend argument to true is useful is HTML text. If one is primarily interested in searching for markup, one can set the estimated alphabet size to 128 and the alphabet will be extended by one for each additional Unicode character that is used. The integer codes for these additional characters will be set by their order of occurrence in the string.
Note: when the third argument has the value
true
, a char array whose dimensions are the same as that for the string passed as the first argument will be allocated. When the third argument has the valuefalse
, the value of each character in the string passed as the first argument must be in the range [0, n).- Parameters:
string
- a string representing a sequence of characters.n
- the estimated size of an alphabet encoded as values in [0,n) where n is positive and no larger than 0xFFFF (216 - 1)extend
- true if the alphabet should be extended to include all characters in the string; false otherwise- Throws:
IllegalArgumentException
- n is out of range.
-
-
Method Details
-
getSequence
Get the sequence associated with this suffix array.- Returns:
- the sequence that this suffix array describes
-
findInstance
Find the index into the sequence associated with a suffix array for an arbitrary instance of a subsequence.Using an LCP-LR table (created by calling
SuffixArray.useLCPLR()
) will change with time complexity of this method from O(m log n) to O(m + log n), where m is the length of a subsequence and n is the length of the sequence array. These, however, are worst-case numbers: while it can take m steps for a comparison function to determine that two suffixes differ, the comparison will stop at the first step at which the suffixes actually differ: the difference in running time in practice is data-set dependent.If the argument has a length of 0, the value returned is the length of the sequence.
- Parameters:
subsequence
- the subsequence.- Returns:
- the index into the sequence; -1 if the subsequence cannot be found
- See Also:
-
findInstance
Find the index into the sequence associated with a suffix array for an arbitrary instance of a subsequence given a starting and ending index into a string containing the subsequence.Using an LCP-LR table (created by calling
SuffixArray.useLCPLR()
) will change with time complexity of this method from O(m log n) to O(m + log n), where m is the length of a subsequence and n is the length of the sequence array. These, however, are worst-case numbers: while it can take m steps for a comparison function to determine that two suffixes differ, the comparison will stop at the first step at which the suffixes actually differ: the difference in running time in practice is data-set dependent.If the argument has a length of 0, the value returned is the length of the sequence. This is also true if the second and third arguments are equal.
- Parameters:
subsequence
- array containing the subsequence.start
- the starting index in the subsequence array (inclusive)end
- the ending index in the subsequence array (exclusive)- Returns:
- the index into the sequence; -1 if the subsequence cannot be found
- See Also:
-
findSubsequence
Find the suffix-array index of an arbitrary instance of a subsequence. This method finds the index into a suffix array corresponding to a subsequence.- Parameters:
subsequence
- the subsequence.- Returns:
- the index into the suffix array; -1 if the subsequence cannot be found
- See Also:
-
findSubsequence
Find the suffix-array index of an arbitrary instance of a subsequence given a starting index and ending index for the subsequence. This method finds the index into a suffix array corresponding to a subsequence. The subsequence consists of the elements of the string with a starting index named start and and ending index named end.- Parameters:
subsequence
- the string containing the subsequencestart
- the starting index in the subsequence string (inclusive)end
- the ending index in the subsequence string (exclusive)- Returns:
- the index into the suffix array; -1 if the subsequence cannot be found
- See Also:
-
getSequenceArray
public char[] getSequenceArray()Get the char array representing this object's sequence. The array that is returned must not be modified.If an explicit alphabet is not used or the if last argument of the three-argument constructor has a value of
false
, calling this argument will allocate an array that will improve performance slightly .- Returns:
- the array of characters for the string used to create this suffix array
-
findSubsequence
Find a subsequence. This method finds the index into a suffix array corresponding to a subsequence.- Parameters:
subsequence
- the subsequence.keyflag
- true if the highest index should be returned; false if the lowest index should be returned.- Returns:
- the index into the suffix array; -1 if the subsequence cannot be found
- See Also:
-
findSubsequence
Find a subsequence given a starting index and ending index for the subsequence. This method finds the index into a suffix array corresponding to a subsequence. The subsequence consists of the elements of the array sarray with a starting index named start and and ending index named end.- Parameters:
subsequence
- the subsequencestart
- the starting index in the subsequence array (inclusive)end
- the ending index in the subsequence array (exclusive)keyflag
- true if the highest index should be returned; false if the lowest index should be returned.- Returns:
- the index into the suffix array; -1 if the subsequence cannot be found
- See Also:
-
findSubsequence
public int findSubsequence(int offset, String subsequence, int start, int end, int first, int last, boolean keyflag) Find a subsequence given a starting index and ending index for the subsequence, restricting the search to a range in this suffix array. This method finds the index into a suffix array corresponding to a subsequence. The subsequence consists of the elements of the array sarray with a starting index named start and and ending index named end. When the range provided by the fifth and sixth arguments is such that all suffixes in this range start with the same N symbols, the offset can be set to N so that these symbols are ignored in comparisions, in which case the third argument should also be set so that the subsequence does not contain these N symbols at its start.- Parameters:
offset
- an offset to add to the index into a sequence provided by the suffix array when comparing valuessubsequence
- the subsequencestart
- the starting index in the subsequence array (inclusive)end
- the ending index in the subsequence array (exclusive)first
- the starting index in the suffix array (inclusive)last
- the ending index int he suffix array (exclusive)keyflag
- true if the highest index should be returned; false if the lowest index should be returned.- Returns:
- the index into the suffix array; -1 if the subsequence cannot be found
- See Also:
-
findRange
Find all instances of a subsequence.- Parameters:
subsequence
- the subsequence- Returns:
- the subsequences corresponding to a range in the suffix array
- See Also:
-
findRange
Find all instances of a subsequence given a starting index and ending index.- Parameters:
subsequence
- the subsequence array.start
- the starting index in the subsequence array (inclusive)end
- the ending index in the subsequence array (exclusive)- Returns:
- the subsequences corresponding to a range in the suffix array
- See Also:
-
getBWT
public int getBWT(char[] bwt) Get the Burrows-Wheeler Transform (BWT) of the sequence associated with this suffix array, with 0xffff indicating the end-of-text addition to the alphabet. The value returned includes the end-of-text symbol in the transform when the length of the array is one more than the length of the sequence associated with this suffix array.- Parameters:
bwt
- the array to store the BWT (an array whose length is the length of the sequence if the end-of-text symbol does not appear in the BWT and one more than the length of the sequence if the end-of-text symbol does appear in the BWT)- Returns:
- the index for the sorted permutation that matches the sequence
-
inverseBWT
public static void inverseBWT(char[] bwt, char[] result, int index, int n) throws IllegalArgumentException Compute the inverse Burrows-Wheeler transform. When the length of of the BWT array is one more than the length of the result array, the BTW array is assumed to contain an end-of-text symbol (-1 for this case), and the index parameter is ignored. If the two arrays have the same length, all symbols in the BWT array must be in the alphabet and the index must be provided (it will be the value returned by a call togetBWT(char[])
).- Parameters:
bwt
- the Burrows-Wheeler transformresult
- the inverse of the Burrons-Wheeler transformindex
- the index parameter for the Burrows-Wheeler transformn
- the size of the alphabet.- Throws:
IllegalArgumentException
- bwt and result have inconsistent lengths
-
fillLCPArray
protected void fillLCPArray(int[] ourlcpArray, int[] rank) Description copied from class:SuffixArray
Fill the LCP array with the correct values. (the value at index 0 will be set outside of this method).A typical implementation would use the following code:
Subclasses will have different types for the variableprotected void fillLCPArray(int[] ourlcpArray, int[] rank) { int k = 0; int n = sequence.length; for (int i = 0; i < n; i++) { if (rank[i] == n) { k = 0; continue; } int j = array[rank[i]+1]; while (i+k < n && j+k < n && sequence[i+k] == sequence[j+k]) { k++; } ourlcpArray[rank[i]+1] = k; if (k > 0) k--; } }
sequence
.- Specified by:
fillLCPArray
in classSuffixArray
- Parameters:
ourlcpArray
- the LCP array to fillrank
- the inverse mapping for this suffix array
-
commonPrefixLength
protected int commonPrefixLength(int index1, int index2) Description copied from class:SuffixArray
Given two indices into the sequence array, compute the length of a common prefix.This method is abstract because the implementation is dependent on the type of sequence-array elements. It's implementation will typically compare the prefixes element by element until an index is found where the elements differ. The implementation, however, must not be dependent on the existence of either the LCP table or the LCP_LR table.
- Specified by:
commonPrefixLength
in classSuffixArray
- Parameters:
index1
- the index into the suffix array for the first suffixindex2
- the index into the suffix array for the second suffix- Returns:
- the length of the LCP for these two suffixes
-