查找一个字符串是否包含集合中的任何字符串

2018-06-13 23:28:13

我试图提高Java函数的性能，我确定给定的搜索字符串是否包含集合中的字符串> 0。这可能看起来像过早的优化，但功能称为很多，所以加快速度将是非常有益的。

代码目前看起来像这样：

public static boolean containsAny(String searchString, List<String> searchCollection) {
    int size = searchCollection.size();
    for (int i = 0; i < size; i++) {
        String stringInCollection = searchCollection.get(i);
        if (!Util.isNullOrEmpty(stringInCollection)) {
            // This is a performance optimization of contains.
            if (searchString.indexOf(stringInCollection, 0) > -1) {
                return true;
            }
        }
    }
    return false;
}

该列表通常具有大约30个元素，并且每次调用之间重复使用相同的集合。

上面的代码是一个非常直接的线性搜索。除非我们改变数据结构以使其好于O（n），否则我认为它不会显着改善。有没有什么数据结构可以让我做到这一点？

使用Aho-Corasick算法可以显着提高速度。

您可以使用O（集合中所有字符串的总长度）时间和空间为集合构建一个Aho-Corasick自动机。然后可以通过遍历该自动机来检查集合中的某个字符串是否为O（S.lenght）时间中给定字符串S的子字符串。

// Make a regex pattern (once only):
StringBuilder pattern = new StringBuilder();
for (String sought : searchCollection) {
    if (!Util.isNullOrEmpty(sought)) {
        if (pattern.length() != 0) {
            pattern.append('|');
        }
        pattern.append(Pattern.quote(sought));
    }
}
final Pattern PATTERN = Pattern.compile("(" + pattern + ")");

这创建了一种替代方式，如"(abc|def|ghi)" 。你可能会考虑不区分大小写的搜索。

并且在函数containsAny ：

Matcher m = PATTERN.matcher(searchString);
return m.find();

正则表达式编译相对聪明。这与使用搜索树搜索树相似"agent" and "agitator" to ("ag", ("ent", "itator"))

这是一个CPU密集型操作，并且不会在I / O上长时间运行或阻塞。如果您使用的是Java 8，则可以使用并行流并行处理，如下所示。该方法已更改为使用Collection而不是List来更灵活。

public static boolean containsAny(final String searchString,
        final Collection<String> searchCollection) {
    return searchCollection.stream().parallel()
            .anyMatch(x -> searchString.indexOf(x) > -1);
}

此外，不应使用List ，而应使用Set作为基础数据结构，以便重复条目（如果有的话）将被删除。

链接地址: http://www.djcxy.com/p/39809.html

上一篇: Finding if a string contains any string in a collection

下一篇: How to compress many strings across a data structure?