How to make an XmlDocument respect HTML escape sequences

2018-06-21 14:56:41

DISCLAIMER: Yes, I know the solution is not optimal, but it is what it is.

We are creating a large XML file and then serving it via a WCF service. The consumer is a vendor that has a mobile gateway. They take the large file and chop it up for mobile calls.

The actual creation bits use the Microsoft XML objects (XmlDocument, XmlElement, XmlTextNode, etc.) and then saved to the file system. The service pulls the file and reconstitutes it as an XML document and serves it.

[OperationContract]
[Description("Gets all products for SnP and Systems.")]
[WebGet(UriTemplate = "shop/products/all?appId={appId}")]
XmlElement GetAllProductsAsXmlDocument(string appId);

When I produce a file, I end up with a file that looks something like this:

<content>&lt;b&gt;Intel® Core™ Duo &amp; 2 GB RAM&lt;/b&gt;</content>

Which, in a browser as HTML, would look like this <b>Intel® Core™ Duo & 2 GB RAM</b> .

The vendor has asked to have the text in the XML document to look like this:

<content>&lt;b&gt;Intel&reg; Core&trade; Duo &amp; 2 GB RAM&lt;/b&gt;</content>

If this were a string, rather than text in an XML node, I could easily do this:

string hackedString = HttpUtility.HtmlEncode(nonHackedTextFromXmlNode);

But encoding and then slapping into the XmlDocument as a TextNode yields:

<content>&lt;b&gt;Intel® Core™ Duo &amp; 2 GB RAM&lt;/b&gt;</content>

So the Microsoft Xml recognizes certain escaped HTML sequences and turns them into the version found in their specification of XML. Dinking around with manually encoding, I can aslo end up with &amp; and &#174; (a messed up ®, as 174 decimal == ®), but if the symbol is recognized going into the XML document, it looks like the above when the escaped versions are loaded.

The question is this Is there some unique type of encoding or setting or "other" that can be used with an .NET XmlDocument to produce nodes that automagically respect HTML encoding rules?

If it can't be done, that is fine. I have already suggested two possibilities:

Create CDATA nodes instead of standard text nodes, so the encoding is not altered

Transform the characters after the file is saved and serve it as a string, not an XmlDocument, in the WCF service.

Have the vendor translate the data to HTML escaped strings

Thoughts?

ADDITIONAL INFO: Per suggestion, added the HTML DTD:

string dtdLink = "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"; 
string dtdDef = "-//W3C//DTD XHTML 1.0 Transitional//EN"
XmlDocumentType docType = htmlDoc.CreateDocumentType("html", dtdDef, dtdLink, null);
htmlDoc.AppendChild(docType);

Still adds extra & to the output. May try the other HTML DTDs, but I am soon to run out of time. Thanks.

The issue, overall, is Microsoft corrects a lot of things. CDATA avoids some of the correction. Note that this correction is proper and there would be no issue if the vendor had their parser set up for UTF-8. Sometimes you just have to say "it is what it is".

The solution employed was add a scrubbing "filter" to the end of the processing pipeline. Nasty solution, since it did not properly solve the problem and the client (internal) now wants the filter on all services.

The proper solution would have been to have the vendor respect UTF-8 so we did not have to scrub perfectly valid characters. Unfortunately, as with many projects, time was more important than quality.

链接地址: http://www.djcxy.com/p/60712.html

上一篇: 如何检测和解决错误编码的Varchar数据？

下一篇: 如何使XmlDocument尊重HTML转义序列