PHP DOMDocument, Unicode problems

2018-06-10 01:18:02

I have some problem here

$source = "<html><body><h1>&#8220;</h1></body></html>";
$dom = new DOMDocument();
$dom->loadHTML($source);
echo $dom->saveHTML();

Output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><h1>“</h1></body></html>

Ok, this work correctly. But if I want to extract the nodes like this

$source = "<html><body><h1>&#8220;</h1></body></html>";
$dom = new DOMDocument();
$dom->loadHTML($source);
$h1 = $dom->getElementsByTagName('h1');
echo $dom->saveHTML($h1->item(0));

It output unrecognized text.

Anyone know how to solve this?

Your code example works for me, output is <h1>“</h1> .

&ldquo;    <ENTITY TYPE="#8220"/>    “    Left double quotation mark

Binary UTF-8 sequence of “ is:

0xE2 (226) 0x80 (128) 0x9C (156)
 |          |           `------ Windows-1252: œ
 |          `--- most Windows 125x encodings: €
 `--- ISO 8859-1, 2, 3, 4, 9, 10, 14, 15, 16: â

So where do you view that output?

Probably inside your browser on windows? If inside your browser, have you tried adding

header('Content-Type: text/html; charset=utf-8');

on top of your script?

See also: Setting the HTTP charset parameter and Checking HTTP Headers.

你需要domdocument构造函数的第二个参数（checkout http://nl.php.net/manual/en/domdocument.construct.php）：

$dom = new DOMDocument('1.0', 'utf-8');

链接地址: http://www.djcxy.com/p/29848.html

上一篇: 如何使HTML5与DOMDocument一起工作？

下一篇: PHP DOMDocument，Unicode问题