separated string but ignoring commas in quotes

I have a string vaguely like this:

foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"

that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)

the above string should split into:

foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"

note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure


Try:

public class Main { 
    public static void main(String[] args) {
        String line = "foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"";
        String[] tokens = line.split(",(?=(?:[^"]*"[^"]*")*[^"]*$)", -1);
        for(String t : tokens) {
            System.out.println("> "+t);
        }
    }
}

Output:

> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"

In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.

Or, a bit friendlier for the eyes:

public class Main { 
    public static void main(String[] args) {
        String line = "foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"";

        String otherThanQuote = " [^"] ";
        String quotedString = String.format(" " %s* " ", otherThanQuote);
        String regex = String.format("(?x) "+ // enable comments, ignore white spaces
                ",                         "+ // match a comma
                "(?=                       "+ // start positive look ahead
                "  (?:                     "+ //   start non-capturing group 1
                "    %s*                   "+ //     match 'otherThanQuote' zero or more times
                "    %s                    "+ //     match 'quotedString'
                "  )*                      "+ //   end group 1 and repeat it zero or more times
                "  %s*                     "+ //   match 'otherThanQuote'
                "  $                       "+ // match the end of the string
                ")                         ", // stop positive look ahead
                otherThanQuote, quotedString, otherThanQuote);

        String[] tokens = line.split(regex, -1);
        for(String t : tokens) {
            System.out.println("> "+t);
        }
    }
}

which produces the same as the first example.

EDIT

As mentioned by @MikeFHay in the comments:

I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split() , so I did:

Splitter.on(Pattern.compile(",(?=(?:[^"]*"[^"]*")*[^"]*$)"))

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, eg:

String input = "foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
    if (input.charAt(current) == '"') inQuotes = !inQuotes; // toggle state
    boolean atLastChar = (current == input.length() - 1);
    if(atLastChar) result.add(input.substring(start));
    else if (input.charAt(current) == ',' && !inQuotes) {
        result.add(input.substring(start, current));
        start = current + 1;
    }
}

If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:

String input = "foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
    char currentChar = builder.charAt(currentIndex);
    if (currentChar == '"') inQuotes = !inQuotes; // toggle state
    if (currentChar == ',' && inQuotes) {
        builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
    }
}
List<String> result = Arrays.asList(builder.toString().split(","));

http://sourceforge.net/projects/javacsv/

https://github.com/pupi1985/JavaCSV-Reloaded (fork of the previous library that will allow the generated output to have Windows line terminators rn when not running Windows)

http://opencsv.sourceforge.net/

CSV API for Java

Can you recommend a Java library for reading (and possibly writing) CSV files?

Java lib or app to convert CSV to XML file?

链接地址: http://www.djcxy.com/p/36968.html

上一篇: 使用jdbc从Web应用程序连接到Oracle 11g时的SQL异常

下一篇: 分隔字符串,但忽略引号中的逗号