split string with regex using a release character and separators
I need to parse an EDI file, where the separators are +
, :
and '
signs and the escape (release) character is ?
. You first split into segments
var data = "NAD+UC+ABC2378::92++XYZ Corp.:Tel ?: ?+90 555 555 11 11:Mobile1?: ?+90 555 555 22 22:Mobile2?: ?+90 555 555 41 71+Duzce+Seferihisar / IZMIR++35460+TR"
var segments = data.Split(''');
then each segment is split into segment data elements by +
, then segment data elements are split into component data elements via :
.
var dataElements = segments[0].Split('+');
the above sample string is not parsed correctly because of the use of release character. I have special code dealing with this, but I am thinking that this should be all doable using
Regex.Split(data, separator);
I am not familiar with Regex'es and could not find a way to do this so far. The best I came up so far is
string[] lines = Regex.Split(data, @"[^?]+");
which omits the character before +
sign.
NA
U
ABC2378::9
+XYZ Corp.:Tel ?: ?+90 555 555 11 11:Mobile1?: ?+90 555 555 22 22:Mobile2?: ?+90 555 555 41 7
Duzc
Seferihisar / IZMI
+3546
TR
Correct Result Should be:
NAD
UC
ABC2378::92
XYZ Corp.:Tel ?: ?+90 555 555 11 11:Mobile1?: ?+90 555 555 22 22:Mobile2?: ?+90 555 555 41 7
Duzce
Seferihisar / IZMIR
35460
TR
So the question is this doable with Regex.Split, and what should the regex separator look like.
I can see that you want to split around plus signs +
only if they are not preceded (escaped) by a question mark ?
. This can be done using the following:
(?<!?)+
This matches one or more +
signs if they are not preceded by a question mark ?
.
Edit: The problem or bug with the previous expression if that it doesn't handle situations like ??+
or ???+
or or ????+
, in other words it doesn't handle situations where ?
s are used to escape themselves.
We can solve this problem by noticing that if there is an odd number of ?
preceding a +
then the last one is definitely escaping the +
so we must not split, but if there is an even number of ?
before a plus then those cancel out each leaving the +
so we should split around it.
From the previous observation we should come up with an expression that matches a +
only if it is preceded by an even number of question marks ?
, and here it is:
(?<!(^|[^?])(??)*?)+
string[] lines = Regex.Split(data, @"+");
would it meet the requirement??
Here is the edit for escaping the '?' before '+'.
string[] lines = Regex.Split(data, @"(?<!?)[+]+");
The '+' end the end would match multiple consecutive occurances of seperator '+'. If you want white spaces instead.
string[] lines = Regex.Split(data, @"(?<!?)[+]");
链接地址: http://www.djcxy.com/p/73858.html