Detect encoding and make everything UTF
I'm reading out lots of texts from various RSS feeds and inserting them into my database.
Of course, there are several different character encodings used in the feeds, eg UTF-8 and ISO-8859-1.
Unfortunately, there are sometimes problems with the encodings of the texts. Example:
1) The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.
2) Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.
3) In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.
What can I do to avoid the cases 2 and 3?
How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?
Can you help me and tell me how to make everything the same encoding? Perhaps with the function mb-detect-encoding()? Can I write a function for this? So my problems are: 1) How to find out what encoding the text uses 2) How to convert it to UTF-8 - whatever the old encoding is
EDIT: Would a function like this work?
function correct_encoding($text) {
$current_encoding = mb_detect_encoding($text, 'auto');
$text = iconv($current_encoding, 'UTF-8', $text);
return $text;
}
I've tested it but it doesn't work. What's wrong with it?
If you apply utf8_encode()
to an already UTF8 string it will return a garbled UTF8 output.
I made a function that addresses all this issues. It´s called Encoding::toUTF8()
.
You don't need to know what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8()
will convert everything to UTF8.
I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.
Usage:
require_once('Encoding.php');
use ForceUTF8Encoding; // It's namespaced now.
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download:
https://github.com/neitanod/forceutf8
Update:
I've included another function, Encoding::fixUFT8()
, which will fix every UTF8 string that looks garbled.
Usage:
require_once('Encoding.php');
use ForceUTF8Encoding; // It's namespaced now.
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Update: I've transformed the function ( forceUTF8
) into a family of static functions on a class called Encoding
. The new function is Encoding::toUTF8()
.
You first have to detect what encoding has been used. As you're parsing RSS feeds (probably via HTTP), you should read the encoding from the charset
parameter of the Content-Type
HTTP header field. If it is not present, read the encoding from the encoding
attribute of the XML processing instruction. If that's missing too, use UTF-8 as defined in the specification.
Edit Here is what I probably would do:
I'd use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type
header field that contains the MIME type and (hopefully) the charset
parameter with the encoding/charset too. If not, we'll analyse the XML PI for the presence of the encoding
attribute and get the encoding from there. If that's also missing, the XML specs define to use UTF-8 as encoding.
$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';
$accept = array(
'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
);
$header = array(
'Accept: '.implode(', ', $accept['type']),
'Accept-Charset: '.implode(', ', $accept['charset']),
);
$encoding = null;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
$response = curl_exec($curl);
if (!$response) {
// error fetching the response
} else {
$offset = strpos($response, "rnrn");
$header = substr($response, 0, $offset);
if (!$header || !preg_match('/^Content-Type:s+([^;]+)(?:;s*charset=(.*))?/im', $header, $match)) {
// error parsing the response
} else {
if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
// type not accepted
}
$encoding = trim($match[2], '"'');
}
if (!$encoding) {
$body = substr($response, $offset + 4);
if (preg_match('/^<?xmls+version=(?:"[^"]*"|'[^']*')s+encoding=("[^"]*"|'[^']*')/s', $body, $match)) {
$encoding = trim($match[1], '"'');
}
}
if (!$encoding) {
$encoding = 'utf-8';
} else {
if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
// encoding not accepted
}
if ($encoding != 'utf-8') {
$body = mb_convert_encoding($body, 'utf-8', $encoding);
}
}
$simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
if (!$simpleXML) {
// parse error
} else {
echo $simpleXML->asXML();
}
}
Detecting the encoding is hard.
mb_detect_encoding
works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.
As long as you only deal with Western European languages, the three major encodings to consider are utf-8
, iso-8859-1
and cp-1252
. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding
(note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding
to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252
.
Once you've detected the encoding you need to convert it to your internal representation ( UTF-8
is the only sane choice). The function utf8_encode
transforms ISO-8859-1
to UTF-8
, so it can only used for that particular input type. For other encodings, use mb_convert_encoding
.
下一篇: 检测编码并使所有的UTF