Removing empty elements from xml with regex that matches a sequence twice

2018-06-23 13:03:51

I'm looking to remove empty elements from an XML file because the reader expects a value. It's not a nil xsi:nil="true" or element without content <Element /> Deserialize Xml with empty elements in C#. But Element where the inner part is simply missing <Element></Element>

I've tried writing my own code for removing these elements, but my code is too slow and the files too large. The end of every item will also contain this pattern. So the following regex would remove valid xml:
@"<.*></*>

I need some sort of regex that makes sure the pattern of the two * are the same.

So:

<Item><One>1</One><Two></Two><Three>3</Three></Item>

Would change into:

<Item><One>1</One><Three>3</Three></Item>

So the fact that it's all one one line makes this harder because it means the end of the item is right after the end of Three, producing the pattern I'd like to look for.

I don't have access to the original data that would allow recreating valid xml.

You want to capture one or more word characters inside < ... >
and match the closing tag by using 1 backreference to what was captured by first group.

<(w+)></1>

See demo at regex101

AFAIK there is no need to capture any group because <a></b> (which would match a simple regex without capturing) is just invalid XML and it can't be in your file (unless you're parsing HTML in which case - even if may be done - I'd suggest to do not use regex). Capturing a group is required only if you're matching non empty nodes but it's not your case.

Note that you have a problem with your regex (besides unescaped /) because you're matching any character with . but it's not allowed to have any character in XML tags. If you absolutely want to use .* then it should be .*? and you should exclude /).

What I would do is to keep regex as simple as possible (still matching valid XML node names or - even better - only what you know is your data input):

<w+></w+>

You should/may have a better check for tag name, for example s*[wd]+s* may be slightly better, regex with less steps will perform better for very large files. Also you may want to add an optional new-line between opening and closing tag.

Note that you may need to loop until no more replacements are done if, for example, you have <outer><inner></inner></outer> and you want it to be reduced to an empty string (especially in this case don't forget to compile your regex).

使用XML Linq

string xml = "<Item><One>1</One><Two></Two><Three>3</Three></Item>";
            XElement item = XElement.Parse(xml);
            item = new XElement("Item", item.Descendants().Where(x => x.Value.Length != 0));

链接地址: http://www.djcxy.com/p/66006.html

上一篇: php / simplexml在文本之前和之后添加元素

下一篇: 使用与序列匹配两次的正则表达式从xml中移除空元素