为什么在Java 8中split有时会在结果数组开始时删除空字符串？

2018-06-06 18:50:58

在Java 8之前，当我们像空字符串一样分割时

String[] tokens = "abc".split("");

拆分机制会在标有|地方拆分

|a|b|c|

因为在每个字符之前和之后存在空格"" 。因此，它最初会产生这个数组

["", "a", "b", "c", ""]

之后会删除尾随的空字符串（因为我们没有明确提供负值来limit参数），所以最终会返回

["", "a", "b", "c"]

在Java 8中，拆分机制似乎已经发生了变化。现在当我们使用

"abc".split("")

我们将得到["a", "b", "c"]数组而不是["", "a", "b", "c"]因此它看起来像空字符串在开始时也被删除。但是这个理论因为例如失败了

"abc".split("a")

在start ["", "bc"]返回数组为空字符串。

有人可以解释一下这里发生了什么，以及Java 8中这种情况下的分割规则是如何改变的？

String.split （它调用Pattern.split ）的行为在Java 7和Java 8之间发生变化。

文档

比较Java 7和Java 8中的Pattern.split的文档，我们观察到添加了以下子句：

如果在输入序列的开始处存在正宽度匹配，则在结果数组的开头会包含一个空的前导子字符串。然而，在开始处的零宽度匹配从不产生这样的空领先子字符串。

与Java 7相比，Java 8中的String.split也添加了相同的子句。

参考实现

让我们比较Java 7和Java 8中参考实现的Pattern.split的代码。代码从grepcode中检索，版本7u40-b43和8-b132。

Java 7

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

Java 8

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

在Java 8中添加以下代码将排除输入字符串开头的零长度匹配，这解释了上述行为。

            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }

保持兼容性

遵循Java 8及以上版本的行为

为了使split在各个版本中的行为一致并与Java 8中的行为兼容：

如果您正则表达式可以匹配零长度字符串，只需添加(?!A)在表达式的末尾 ，敷在非捕获组原正则表达式(?:...)如有必要）。

如果你的正则表达式不能匹配零长度的字符串，你不需要做任何事情。

如果您不知道正则表达式是否可以匹配零长度字符串，请在第1步中执行这两个操作。

(?!A)检查字符串是否不在字符串的开始处结束，这意味着匹配在字符串的开始处是空匹配。

遵循Java 7和之前的行为

没有通用的解决方案来使split与Java 7和之前的版本向后兼容，而不是将所有split实例替换为指向您自己的自定义实现。

这已在split(String regex, limit)的文档中指定。

如果在该字符串的开头处存在正宽度匹配，则在结果数组的开头会包含一个空的前导子字符串。然而，在开始处的零宽度匹配从不产生这样的空领先子字符串。

在"abc".split("")你在开始处得到一个零宽度的匹配，所以前导空子串不包含在结果数组中。

然而，在你的第二个片段中，当你在"a"上分割时，你得到了一个正宽度匹配（在这种情况下为1），所以如预期的那样包含空的前导子字符串。

（删除不相关的源代码）

split()从Java 7到Java 8的文档略有变化。具体而言，添加了以下语句：

如果在该字符串的开头处存在正宽度匹配，则在结果数组的开头会包含一个空的前导子字符串。 然而，在开始处的零宽度匹配从不产生这样的空领先子字符串。

（强调我的）

空字符串split在开始时会生成零宽度匹配，因此根据上面指定的内容，空字符串不会包含在结果数组的开头。相比之下，在"a"上分割的第二个示例会在字符串的开始处生成正宽度匹配，因此实际上在结果数组的开头会包含一个空字符串。

链接地址: http://www.djcxy.com/p/20957.html

上一篇: Why in Java 8 split sometimes removes empty strings at start of result array?

下一篇: capturing cat output periodically for R shiny output (renderPrint)