Why in Java 8 split sometimes removes empty strings at start of result array?
Before Java 8 when we split on empty string like
String[] tokens = "abc".split("");
split mechanism would split in places marked with |
|a|b|c|
because empty space ""
exists before and after each character. So as result it would generate at first this array
["", "a", "b", "c", ""]
and later will remove trailing empty strings (because we didn't explicitly provide negative value to limit
argument) so it will finally return
["", "a", "b", "c"]
In Java 8 split mechanism seems to have changed. Now when we use
"abc".split("")
we will get ["a", "b", "c"]
array instead of ["", "a", "b", "c"]
so it looks like empty strings at start are also removed. But this theory fails because for instance
"abc".split("a")
is returning array with empty string at start ["", "bc"]
.
Can someone explain what is going on here and how rules of split for this cases have changed in Java 8?
The behavior of String.split
(which calls Pattern.split
) changes between Java 7 and Java 8.
Documentation
Comparing between the documentation of Pattern.split
in Java 7 and Java 8, we observe the following clause being added:
When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
The same clause is also added to String.split
in Java 8, compared to Java 7.
Reference implementation
Let us compare the code of Pattern.split
of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.
Java 7
public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);
// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}
// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
Java 8
public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);
// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}
// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.
if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}
Maintaining compatibility
Following behavior in Java 8 and above
To make split
behaves consistently across versions and compatible with the behavior in Java 8:
(?!A)
at the end of the regex and wrap the original regex in non-capturing group (?:...)
(if necessary). (?!A)
checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.
Following behavior in Java 7 and prior
There is no general solution to make split
backward-compatible with Java 7 and prior, short of replacing all instance of split
to point to your own custom implementation.
This has been specified in the documentation of split(String regex, limit)
.
When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
In "abc".split("")
you got a zero-width match at the beginning so the leading empty substring is not included in the resulting array.
However in your second snippet when you split on "a"
you got a positive width match (1 in this case), so the empty leading substring is included as expected.
(Removed irrelevant source code)
There was a slight change in the docs for split()
from Java 7 to Java 8. Specifically, the following statement was added:
When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
(emphasis mine)
The empty string split generates a zero-width match at the beginning, so an empty string is not included at the start of the resulting array in accordance with what is specified above. By contrast, your second example which splits on "a"
generates a positive-width match at the start of the string, so an empty string is in fact included at the start of the resulting array.
上一篇: 在Java中删除字符串中的空格