How to detect Stream encoding?
Based on this question: How can I get HTML source code from TWebBrowser
If I run this code with a html page that has Unicode code page, the result is gibberish becouse TStringStream is not Unicode in D7. the page might be UTF8 encoded or other (Ansi) code page encoded.
How can I detect if a TStream/IPersistStreamInit is Unicode/UTF8/Ansi?
How do I always return correct result as WideString for this function?
function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString;
If I replace TStringStream with TMemoryStream, and save TMemoryStream to file it's all good. It can be either Unicode/UTF8/Ansi. but I always want to return the stream back as WideString:
function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString;
var
// LStream: TStringStream;
LStream: TMemoryStream;
Stream : IStream;
LPersistStreamInit : IPersistStreamInit;
begin
if not Assigned(WebBrowser.Document) then exit;
// LStream := TStringStream.Create('');
LStream := TMemoryStream.Create;
try
LPersistStreamInit := WebBrowser.Document as IPersistStreamInit;
Stream := TStreamAdapter.Create(LStream,soReference);
LPersistStreamInit.Save(Stream,true);
// result := LStream.DataString;
LStream.SaveToFile('c:testtest.txt'); // test only - file is ok
Result := ??? // WideString
finally
LStream.Free();
end;
end;
EDIT: I found this article - How to load and save documents in TWebBrowser in a Delphi-like way
Which does exactlly what I need. but it works correctlly only with Delphi Unicode compilers (D2009+). read Conclusion section:
There is obviously a lot more we could do. A couple of things immediately spring to mind. We retro-fit some of the Unicode functionality and support for non-ANSI encodings to the pre-Unicode compiler code. The present code when compiled with anything earlier than Delphi 2009 will not save document content to strings correctly if the document character set is not ANSI.
The magic is obviously in TEncoding
class ( TEncoding.GetBufferEncoding
). but D7 does not have TEncoding
. Any ideas?
我使用GpTextStream来处理转换(应该适用于所有的Delphi版本):
function GetCodePageFromHTMLCharSet(Charset: WideString): Word;
const
WIN_CHARSET = 'windows-';
ISO_CHARSET = 'iso-';
var
S: string;
begin
Result := 0;
if Charset = 'unicode' then
Result := CP_UNICODE else
if Charset = 'utf-8' then
Result := CP_UTF8 else
if Pos(WIN_CHARSET, Charset) <> 0 then
begin
S := Copy(Charset, Length(WIN_CHARSET) + 1, Maxint);
Result := StrToIntDef(S, 0);
end else
if Pos(ISO_CHARSET, Charset) <> 0 then // ISO-8859 (e.g. iso-8859-1: => 28591)
begin
S := Copy(Charset, Length(ISO_CHARSET) + 1, Maxint);
S := Copy(S, Pos('-', S) + 1, 2);
if S = '15' then // ISO-8859-15 (Latin 9)
Result := 28605
else
Result := StrToIntDef('2859' + S, 0);
end;
end;
function GetWebBrowserHTML(WebBrowser: TWebBrowser): WideString;
var
LStream: TMemoryStream;
Stream: IStream;
LPersistStreamInit: IPersistStreamInit;
TextStream: TGpTextStream;
Charset: WideString;
Buf: WideString;
CodePage: Word;
N: Integer;
begin
Result := '';
if not Assigned(WebBrowser.Document) then Exit;
LStream := TMemoryStream.Create;
try
LPersistStreamInit := WebBrowser.Document as IPersistStreamInit;
Stream := TStreamAdapter.Create(LStream, soReference);
if Failed(LPersistStreamInit.Save(Stream, True)) then Exit;
Charset := (WebBrowser.Document as IHTMLDocument2).charset;
CodePage := GetCodePageFromHTMLCharSet(Charset);
N := LStream.Size;
SetLength(Buf, N);
TextStream := TGpTextStream.Create(LStream, tsaccRead, [], CodePage);
try
N := TextStream.Read(Buf[1], N * SizeOf(WideChar)) div SizeOf(WideChar);
SetLength(Buf, N);
Result := Buf;
finally
TextStream.Free;
end;
finally
LStream.Free();
end;
end;
链接地址: http://www.djcxy.com/p/91056.html
上一篇: 八进制常量像十进制常量一样工作
下一篇: 如何检测流编码?