Converting a byte array to a string given encoding

2018-06-28 01:47:43

I read from a file to a byte array:

auto text = cast(immutable(ubyte)[]) read("test.txt");

I can get the type of character encoding using the following function:

enum EncodingType {ANSI, UTF8, UTF16LE, UTF16BE, UTF32LE, UTF32BE}

EncodingType DetectEncoding(immutable(ubyte)[] data){
  switch (data[0]){
    case 0xEF:
      if (data[1] == 0xBB && data[2] == 0xBF){
        return EncodingType.UTF8;
      } break;
    case 0xFE:
      if (data[1] == 0xFF){
        return EncodingType.UTF16BE;
      } break;
    case 0xFF:
      if (data[1] == 0xFE){
        if (data[2] == 0x00 && data[3] == 0x00){
          return EncodingType.UTF32LE;
        }else{
          return EncodingType.UTF16LE;
        }
      }
    case 0x00:
      if (data[1] == 0x00 && data[2] == 0xFE && data[3] == 0xFF){
        return EncodingType.UTF32BE;
      }
    default:
      break;
  }
  return EncodingType.ANSI;
}

I need a function that takes a byte array and returns the text string (utf-8). If the text is encoded in UTF-8, then the transformation is trivial. Similarly, if the encoding is UTF-16 or UTF-32 native byte order for the system.

string TextDataToString(immutable(ubyte)[] data){
  import std.utf;
  final switch (DetectEncoding(data[0..4])){
    case EncodingType.ANSI:
      return null;/*???*/
    case EncodingType.UTF8:
      return cast(string) data[3..$];
    case EncodingType.UTF16LE:
      wstring result;
      version(LittleEndian) { result = cast(wstring) data[2..$]; }
      version(BigEndian) { result = "";/*???*/ }
      return toUTF8(result);
    case EncodingType.UTF16BE:
      return null;/*???*/
    case EncodingType.UTF32LE:
      dstring result;
      version(LittleEndian) { result = cast(dstring) data[4..$]; }
      version(BigEndian) { result = "";/*???*/ }
      return toUTF8(result);
    case EncodingType.UTF32BE:
      return null;/*???*/
  }
}

But I could not figure out how to convert byte array with ANSI encoded text (for example, windows-1251) or UTF-16/32 with NOT native byte order. I ticked the appropriate places in the code with /*???*/ .

As a result, the following code should work, with any encoding of a text file:

string s = TextDataToString(text);
writeln(s);

Please help!

BOMs are optional. You cannot use them to reliably detect the encoding. Even if there is a BOM, using it to distinguish UTF from code page encodings is problematic, because the byte sequences are usually valid (if nonsensical) in those, too. Eg 0xFE 0xFF is "юя" in Windows-1251.

Even if you could tell UTF from code page encodings, you couldn't tell the different code pages from another. You could analyze the whole text and make guesses, but that's super error prone and not very practical.

So, I'd advise you to not try to detect the encoding. Instead, require a specific encoding, or add a mechanism to specify it.

As for trandscoding from a different byte order, example for UTF16BE:

import std.algorithm: map;
import std.bitmanip: bigEndianToNative;
import std.conv: to;
import std.exception: enforce;
import std.range: chunks;

alias C = wchar;
enforce(data.length % C.sizeof == 0);
auto result = data
    .chunks(C.sizeof)
    .map!(x => bigEndianToNative!C(x[0 .. C.sizeof]))
    .to!string;

链接地址: http://www.djcxy.com/p/78436.html

上一篇: 字符数组中的字符数组以UTF格式表示

下一篇: 将字节数组转换为给定编码的字符串