Is there an easy way to work around a Delphi utf8
I have discovered (the hard way) that if a file has a valid UTF-8 BOM but contains any invalid UTF8 encodings, and is read by any of the Delphi (2009+) encoding-enabled methods such as LoadFromFile
, then the result is a completely empty file with no error indication. In several of my applications, I would prefer to simply lose a few bad encodings, even if I get no error report in this case either.
Debugging reveals that MultiByteToWideChar
is called twice, first to get the output buffer size, then to do the conversion. But TEncoding.UTF8 contains a private FMBToWCharFlags
value for these calls, and this is initialized with a MB_ERR_INVALID_CHARS
value. So the call to get the charcount returns 0 and the loaded file is completely empty. Calling this API without the flag would 'silently drop illegal code points'.
My question is how best to weave through the nest of classes in the Encoding area to work around the fact that this is a private value (and needs to be, because it is a class var for all threads). I think I could add a custom UTF8 encoding, using the guidance in Marco Cantu's Delphi 2009 book. And it could optionally raise an exception if MultiByteToWideChar
has returned an encoding error, after calling it again without the flag. But that does not solve the problem of how to get my custom encoding used instead of Tencoding.UTF8
.
If I could just set this up as a default for the application at initialization, perhaps by actually modifying the class var for Tencoding.UFT8
, this would probably be sufficient.
Of course, I need a solution without waiting to lodge a QC report asking for a more robust design, getting it accepted, and seeing it changed.
Any ideas would be very welcome. And can someone confirm this is still an issue for XE4, which I have not yet installed?
I ran into the MB_ERR_INVALID_CHARS
issue when I first updated Indy to support TEncoding
, and ended up implementing a custom TEncoding
-derived class for UTF-8 handling to avoid specifying MB_ERR_INVALID_CHARS
. I didn't think to use a class helper.
However, this issue is not just limited to UTF-8. Any decoding failure of any of the TEncoding
classes will result in a blank result, not an exception being raised. Why Embarcadero chose that route, when most of the RTL/VCL uses exceptions instead, is beyond me. Not raising an exception on error caused a fair amount of issues in Indy that had to be worked around.
This can be done pretty simply, at least in Delphi XE5 (have not checked earlier versions). Just instantiate your own TUTF8Encoding
:
procedure LoadInvalidUTF8File(const Filename: string);
var
FEncoding: TUTF8Encoding;
begin
FEncoding := TUTF8Encoding.Create(CP_UTF8, 0, 0);
// Instead of CP_UTF8, MB_ERR_INVALID_CHARS, 0
try
with TStringList.Create do
try
LoadFromFile(Filename, FEncoding);
// ...
finally
Free;
end;
finally
FEncoding.Free;
end;
end;
The only issue here is that the IsSingleByte
property for the newly instantiated TUTF8Encoding
is then incorrectly set to False
, but this property is not currently used anywhere in the Delphi sources.
A partial workaround is to force the UTF8 encoding to suppress MB_ERR_INVALID_CHARS
globally. For me, this avoids the need for raising an exception, because I find it makes MultiByteToWideChar
not quite 'silent': it actually inserts $fffd
characters (Unicode 'replacement character') which I can then find in the cases where this is important. The following code does this:
unit fixutf8;
interface
uses System.Sysutils;
type
TUTF8fixer = class helper for Tmbcsencoding
public
procedure setflag0;
end;
implementation
procedure TUTF8fixer.setflag0;
{$if CompilerVersion = 31}
asm
XOR ECX,ECX
MOV Self.FMBToWCharFlags,ECX
end;
{$else}
begin
Self.FMBToWCharFlags := 0;
end;
{$endif}
procedure initencoding;
begin
(Tencoding.UTF8 as TmbcsEncoding).setflag0;
end;
initialization
initencoding;
end.
A more useful and principled fix would require changing the calls to MultiByteToWideChar
not to use MB_ERR_INVALID_CHARS
, and to make an initial call with this flag so that an exception could be raised after the load is complete, to indicate that characters will have been replaced.
There are relevant QC reports on this issue, including 76571, 79042 and 111980. The first one has been resolved 'as designed'.
(Edited to work with Delphi Berlin)
链接地址: http://www.djcxy.com/p/16170.html上一篇: 在Java和Delphi中比较和比较接口