HttpUtility.HtmlEncode escaping too much?

In our MVC3 ASP.net project, the HttpUtility.HtmlEncode method seems to be escaping too much characters. Our web pages are served as UTF-8 pages, but still the method escapes characters like ü or the Yen character ¥, even though tese characters are part of the UTF-8 set.

So when my asp.net MVC view contains the following piece of code:

    @("<strong>ümlaut</strong>")

Then I would expect the Encoder to escape the html tags, but not the ümlaut

    &lt;strong&gt;ümlaut&lt;/strong&gt;

But instead it is giving me the following piece of HTML:

    &lt;strong&gt;&#252;mlaut&lt;/strong&gt;

For completeness, I also mention that the responseEncoding in the web.config is explictely set to utf-8, so I would expect the HtmlEncode method to respect this setting.

    <globalization requestEncoding="utf-8" responseEncoding="utf-8" />

Yes I have the face the same issue with my web pages. If we see the code of htmlEncode there is a point that translate this set of characters. Here is the code that this kind of characters also translated.

if ((ch >= 'x00a0') && (ch < 'A'))
{
    output.Write("&#");
    output.Write(ch.ToString(NumberFormatInfo.InvariantInfo));
    output.Write(';');
}
else
{
    output.Write(ch);
}

Here is the code of HtmlEncode

public static unsafe void HtmlEncode(string value, TextWriter output)
{
    if (value != null)
    {
        if (output == null)
        {
            throw new ArgumentNullException("output");
        }
        int num = IndexOfHtmlEncodingChars(value, 0);
        if (num == -1)
        {
            output.Write(value);
        }
        else
        {
            int num2 = value.Length - num;
            fixed (char* str = ((char*) value))
            {
                char* chPtr = str;
                char* chPtr2 = chPtr;
                while (num-- > 0)
                {
                    output.Write(chPtr2[0]);
                    chPtr2++;
                }
                while (num2-- > 0)
                {
                    char ch = chPtr2[0];
                    if (ch <= '>')
                    {
                        switch (ch)
                        {
                            case '&':
                            {
                                output.Write("&amp;");
                                chPtr2++;
                                continue;
                            }
                            case ''':
                            {
                                output.Write("&#39;");
                                chPtr2++;
                                continue;
                            }
                            case '"':
                            {
                                output.Write("&quot;");
                                chPtr2++;
                                continue;
                            }
                            case '<':
                            {
                                output.Write("&lt;");
                                chPtr2++;
                                continue;
                            }
                            case '>':
                            {
                                output.Write("&gt;");
                                chPtr2++;
                                continue;
                            }
                        }
                        output.Write(ch);
                        chPtr2++;
                        continue;
                    }
                    // !here is the point!
                    if ((ch >= 'x00a0') && (ch < 'Ā'))
                    {
                        output.Write("&#");
                        output.Write(ch.ToString(NumberFormatInfo.InvariantInfo));
                        output.Write(';');
                    }
                    else
                    {
                        output.Write(ch);
                    }
                    chPtr2++;
                }
            }
        }
    }
}

a Possible solutions is to make your custom HtmlEncode, or use the Anti-Cross Site scripting from MS.

http://msdn.microsoft.com/en-us/security/aa973814


As Aristos suggested we could use the AntiXSS library from Microsoft. It contains a UnicodeCharacterEncoder that behaves as you would expect.

But because we

  • didn't really want to depend on a 3rd party library just for HTML Encoding
  • were quite sure that our content didn't exceed the UTF-8 range.
  • We chose to implement our own very basic HTML encoder. You can find the code below. Please feel free to adapt/comment/improve if you see any issues.

    public static class HtmlEncoder
    {
        private static IDictionary<char, string> toEscape = new Dictionary<char, string>()
                                                                {
                                                                    { '<', "lt" },
                                                                    { '>', "gt" },
                                                                    { '"', "quot" },
                                                                    { '&', "amp" },
                                                                    { ''', "#39" },
                                                                };
        /// <summary>
        /// HTML-Encodes the provided value
        /// </summary>
        /// <param name="value">object to encode</param>
        /// <returns>An HTML-encoded string representing the provided value.</returns>
        public static string Encode(object value)
        {
            if (value == null)
                return string.Empty;
    
            // If value is bare HTML, we expect it to be encoded already
            if (value is IHtmlString)
                return value.ToString();
    
            string toEncode = value.ToString();
    
            // Init capacity to length of string to encode
            var builder = new StringBuilder(toEncode.Length);
    
            foreach (char c in toEncode)
            {
                string result;
                bool success = toEscape.TryGetValue(c, out result);
    
                string character = success
                                    ? "&" + result + ";"
                                    : c.ToString();
    
                builder.Append(character);
            }
    
            return builder.ToString();
        }
    }
    

    基于Thomas的回答,对空间,制表符和新行处理进行了一些改进,因为它们可能会破坏html的结构:

    public static string HtmlEncode(string value,bool removeNewLineAndTabs)
        {
            if (value == null)
                return string.Empty;
    
            string toEncode = value.ToString();
    
            // Init capacity to length of string to encode
            var builder = new StringBuilder(toEncode.Length);
    
            foreach (char c in toEncode)
            {
                string result;
                bool success = toEscape.TryGetValue(c, out result);
    
                string character = success ? result : c.ToString();
    
                builder.Append(character);
            }
    
            string retVal = builder.ToString();
    
            if (removeNewLineAndTabs)
            {
                retVal = retVal.Replace("rn", " ");
                retVal = retVal.Replace("r", " ");
                retVal = retVal.Replace("n", " ");
                retVal = retVal.Replace("t", " ");
            }
            return retVal;
        }
    
    链接地址: http://www.djcxy.com/p/10430.html

    上一篇: Oracle ROWNUM性能

    下一篇: HttpUtility.HtmlEncode转义过多?