How to Detect and Resolve Incorrectly Encoded Varchar Data?

2018-06-21 14:57:43

My company has a CRM product that is built on top of a third party webmail system. We use their underlying database, and have extended it with additional databases of our own. As well as using our product, clients are able to log into the webmail system directly.

The webmail databases are SQL_Latin1_General_CP1_CI_AS encoded and contact names are stored in varchar columns, not nvarchar.

both our product and the webmail product serve pages with Content-Type: text/html charset=utf-8

If a client creates a contact in webmail (the 3rd party system) with the first name "Céline" it ends up stored in the database as "CÃ©line". This is because webmail seems to first convert the data from utf-8 to latin-1 before storing it in the database. The utf-8 char 'é' is stored as two bytes, which in latin-1 are interpreted as the two characters: "Ã©"

However, when the data is retrieved and displayed in webmail, it displays correctly as 'Céline'

The problem is: When reading/writing to contacts from our CRM system, if you set the first name to 'Céline' it is stored as 'Céline', instead of being converted first to latin-1 'CÃ©line'

vice versa, if you create Céline in webmail, it displays in our CRM product as CÃ©line because its not being converted from latin-1 to utf-8

Our product has french internationalization and has been in production for quite a few months, so there is quite a bit of data in the system with both methods of encoding.

i can convert from latin-1 to utf-8 using:

var bytes = Encoding.GetEncoding("iso-8859-1").GetBytes(Convert.ToString(obj))
string fix2 = Encoding.UTF8.GetString(bytes).Trim(); //from iso-8859-1 (latin-1) to utf-8

But that only works if the data was correctly converted to latin-1 before being stored. So what I really need is a way to determine if the data in the record is a utf-8 encoded string or a latin-1 encoded string.

Or, moving forward, I need a way to mimic what webmail is doing, and make all write operations to the database first convert from utf-8 to latin-1, and all read operations convert from latin-1 to utf-8.

Any ideas? Please let me know if you need additional information/clarification.

Some clarifications. There is a difference between converting a byte stream between character encodings (this will modify the bytes) and interpreting a byte stream using different character encodings (this will not modify the bytes, just display them differently). Your webmail application does not convert the UTF-8 characters on the way to the database, but rather (incorrectly) reinterprets the byte stream.

Is it possible to detect the incorrectly encoded characters?

In theory, no. The characters, interpreted as ISO-8859-1 are perfectly valid. In practice you could hand-craft a search for not-so-common characters such as Ã in your example and be able to find the inconsistencies.

I need a way to mimick what webmail is doing

To reinterpret a string in C# from UTF-8 to ISO-8859-1 you can use the following line (remember to perform the opposite on the way back from the database)

Encoding.GetEncoding("iso-8859-1").GetString(Encoding.UTF8.getBytes("Some text"))

链接地址: http://www.djcxy.com/p/60714.html

上一篇: 全局web.config覆盖网站web.config？

下一篇: 如何检测和解决错误编码的Varchar数据？