Class ACMatcher

java.lang.Object
org.bzdev.util.ACMatcher

public class ACMatcher extends Object
This class provides an implementation of the Aho Corasick string-matching algorithm.

If there are a number of predetermined patterns (substrings) to search for, ACMatcher will outperform the same search using a SuffixArray. If multiple searches are performed on the same string, with the pattern or substring not known in advance, a SuffixArray is a better choice.

The implementation is based on one contributed to the Geeks for Geeks website by Princi Singh as an example of an Aho Corasick algorithm implementation. The modifications included using a BitSet instead of an integer (which restricted the number of patterns to 32), a dynamically configured alphabet, and methods to provide iterators and streams. The implementation in this package can be configured to be either case sensitive or case insensitive. Finally, the sequence being searched can be represented as either a string or an array of characters, and the search can be restricted to a subsequence of either.

The patterns to search for are passed to a constructor. The methods used to search are named

  • iterator. This provides an Iterator.
  • iterableOver. This provides an Iterable for use in a 'for' loop.
  • stream. This provides a Stream.
      These are overloaded so that the text being searched can be either a string or a character array, and so that all the text can be searched or just a subsequence of the text.
  • Constructor Details

    • ACMatcher

      public ACMatcher(String first, String... rest)
      Constructor using a variable number of arguments. The search is case sensitive.
      Parameters:
      first - the first search string
      rest - the remaining search strings
    • ACMatcher

      public ACMatcher(boolean ignoreCase, String first, String... rest)
      Constructor using a variable number of arguments and indicating if the search is case sensitive or case insensitive.
      Parameters:
      ignoreCase - true if the search is case insensitive; false if the search is case sensitive
      first - the first search string
      rest - the remaining search strings
    • ACMatcher

      public ACMatcher(Function<T,String> f, T[] specs)
      Constructor using an array of pattern specifications. A function maps each pattern specification to the corresponding pattern. This can be used to associate each pattern with a enumeration, which can make the use of a switch statement more reliable when new cases are added. For example,
      
       static enum SpecType {
            TYPE1,
            TYPE2,
            ...
       }
       static class Spec {
         SpecType type
         String pattern;
         public Spec(specType type, string pattern) {
             this.type = type;
             this.pattern = pattern;
       }
       ...
         Spec specs[] = {
             new Spec(SpecType.TYPE1, "foo"),
             new Spec(SpecType.TYPE2, "bar"),
             ...
         };
         ACMatcher matcher = new
           ACMatcher((spec) -> {return spec.pattern;}, specs);
         for (ACMatcher.MatchResult mr: matcher.interatorOver(text)) {
             int index = mr.getIndex();
             switch (spec[index].type) {
             case TYPE1:
                ...
             }
         }
       
      Type Parameters:
      T - the type of the objects used as a pattern specifications
      Parameters:
      f - a function that maps a pattern specification to a pattern
      specs - an array containing pattern specifications
    • ACMatcher

      public ACMatcher(boolean ignoreCase, Function<T,String> f, T[] specs)
      Constructor using an array of pattern specifications and specifying if the matcher is case sensitive. A function maps each pattern specification to the corresponding pattern. This can be used to associate each pattern with a enumeration, which can make the use of a switch statement more reliable when new cases are added. For example,
      
       static enum SpecType {
            TYPE1,
            TYPE2,
            ...
       }
       static class Spec {
         SpecType type
         String pattern;
         public Spec(specType type, string pattern) {
             this.type = type;
             this.pattern = pattern;
       }
       ...
         Spec specs[] = {
             new Spec(SpecType.TYPE1, "foo"),
             new Spec(SpecType.TYPE2, "bar"),
             ...
         };
         ACMatcher matcher = new
           ACMatcher((spec) -> {return spec.pattern;}, specs);
         for (ACMatcher.MatchResult mr: matcher.interatorOver(text)) {
             int index = mr.getIndex();
             switch (spec[index].type) {
             case TYPE1:
                ...
             }
         }
       
      Type Parameters:
      T - the type of the objects used as a pattern specifications
      Parameters:
      ignoreCase - true if the search is case insensitive; false if the search is case sensitive
      f - a function that maps a pattern specification to a pattern
      specs - an array containing pattern specifications
    • ACMatcher

      public ACMatcher(String[] strings)
      Constructor. The search is case sensitive.
      Parameters:
      strings - the strings to match
    • ACMatcher

      public ACMatcher(boolean ignoreCase, String[] strings)
      Constructor specifying if the matcher is case sensitive or case insensitive.
      Parameters:
      ignoreCase - true if the search is case insensitive; false if the search is case sensitive
      strings - the strings to match
  • Method Details

    • size

      public int size()
      Get the number of patterns for this matcher.
      Returns:
      the number of patterns
    • getPatterns

      public String[] getPatterns()
      Get the patterns (search strings) for this matcher. If the search is case insensitive, the patterns will use lower case, regardless of the cases used in a constructor.
      Returns:
      the patterns
    • iterableOver

      public Iterable<ACMatcher.MatchResult> iterableOver(String text)
      Get an Iterable that searches the given text, provided as a string, for the patterns used to configure this matcher.

      This is a convenience method, added to allow an instance of this class to be used in a 'for' loop:

      
       String patterns[] = {...};
       ACMatcher matcher = new ACMatcher(patterns);
       String text = "...";
       for (ACMatcher.MatchResult mr: matcher.iterableOver(text)) {
          ...
       }
       
      Parameters:
      text - the text to search
      Returns:
      the iterable
    • iterableOver

      public Iterable<ACMatcher.MatchResult> iterableOver(String text, int start, int end)
      Get an Iterable that starch's a portion of the given text, provided as a string, for the patterns used to configure this matcher.

      This is a convenience method, added to allow an instance of this class to be used in a 'for' loop:

      
       String patterns[] = {...};
       ACMatcher matcher = new ACMatcher(patterns);
       String text = "...";
       int start = ...;
       int end = ...;
       for (ACMatcher.MatchResult mr: matcher.iterableOver(text, start, end)) {
          ...
       }
       
      Parameters:
      text - the text to search
      start - the starting index (inclusive) for the text being scanned
      end - the ending index (exclusive) for the text being scanned
      Returns:
      the iterable
    • iterableOver

      public Iterable<ACMatcher.MatchResult> iterableOver(char[] text)
      Get an Iterable that searcher the given text, provided as a character array, for the patterns used to configure this matcher.

      This is a convenience method, added to allow an instance of this class to be used in a 'for' loop:

      
       String patterns[] = {...};
       ACMatcher matcher = new ACMatcher(patterns);
       String text = "...";
       for (ACMatcher.MatchResult mr: matcher.iterableOver(text)) {
          ...
       }
       
      Parameters:
      text - the text to search
      Returns:
      the iterable
    • iterableOver

      public Iterable<ACMatcher.MatchResult> iterableOver(char[] text, int start, int end)
      Get an Iterable that searcher's a portion of the given text, provided as a character array, for the patterns used to configure this matcher.

      This is a convenience method, added to allow an instance of this class to be used in a 'for' loop:

      
       String patterns[] = {...};
       ACMatcher matcher = new ACMatcher(patterns);
       String text = "...";
       int start = ...;
       int end = ...;
       for (ACMatcher.MatchResult mr: matcher.iterableOver(text, start, end)) {
          ...
       }
       
      Parameters:
      text - the text to search
      start - the starting index (inclusive) for the text being scanned
      end - the ending index (exclusive) for the text being scanned
      Returns:
      the iterable
    • stream

      public Stream<ACMatcher.MatchResult> stream(String text)
      Get a stream of matches for the given text, provided as a string.
      Parameters:
      text - the text to scan
      Returns:
      the stream
    • stream

      public Stream<ACMatcher.MatchResult> stream(String text, int start, int end)
      Get a stream of matches for a portion of the given text, provided as a string.
      Parameters:
      text - the text to scan
      start - the starting index (inclusive) for the text being scanned
      end - the ending index (exclusive) for the text being scanned
      Returns:
      the stream
    • stream

      public Stream<ACMatcher.MatchResult> stream(char[] text)
      Get a stream of matches for the given text, provided as a character array.
      Parameters:
      text - the text to scan
      Returns:
      the stream
    • stream

      public Stream<ACMatcher.MatchResult> stream(char[] text, int start, int end)
      Get a stream of matches for a portion the given text, provided as a character array.
      Parameters:
      text - the text to scan
      start - the starting index (inclusive) for the text being scanned
      end - the ending index (exclusive) for the text being scanned
      Returns:
      the stream
    • iterator

      public Iterator<ACMatcher.MatchResult> iterator(String text)
      Get an iterator that will scan a given string and match that string against the patterns used to configure this matcher.
      Parameters:
      text - the string to scan
      Returns:
      an iterator that will enumerate the matches
      See Also:
    • iterator

      public Iterator<ACMatcher.MatchResult> iterator(String text, int start, int end)
      Get an iterator that will scan a portion of a given string and match that string against the patterns used to configure this matcher.
      Parameters:
      text - the string to scan
      start - the starting index (inclusive) for the text being scanned
      end - the ending index (exclusive) for the text being scanned
      Returns:
      an iterator that will enumerate the matches
      See Also:
    • iterator

      public Iterator<ACMatcher.MatchResult> iterator(char[] text)
      Get an iterator that will scan a given character array and match that sequence against the patterns used to configure this matcher.
      Parameters:
      text - the character array to scan
      Returns:
      an iterator that will enumerate the matches
      See Also:
    • iterator

      public Iterator<ACMatcher.MatchResult> iterator(char[] text, int start, int end)
      Get an iterator that will scan a portion of a given character array and match that sequence against the patterns used to configure this matcher.
      Parameters:
      text - the character array to scan
      start - the starting index (inclusive) for the text being scanned
      end - the ending index (exclusive) for the text being scanned
      Returns:
      an iterator that will enumerate the matches
      See Also: