Determine the encoding of text in Python

I received some text that is encoded, but I don't know what charset was used. Is there a way to determine the encoding of a text file using Python? How can I detect the encoding/codepage of a text file deals with C#.


Correctly detecting the encoding all times is impossible .

(From chardet FAQ:)

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

You can also use UnicodeDammit. It will try the following methods:

  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252

  • Another option for working out the encoding is to use libmagic (which is the code behind the file command). There are a profusion of python bindings available.

    The python bindings that live in the file source tree are available as the python-magic (or python3-magic) debian package. If can determine the encoding of a file by doing:

    import magic
    
    blob = open('unknown-file').read()
    m = magic.open(magic.MAGIC_MIME_ENCODING)
    m.load()
    encoding = m.buffer(blob)  # "utf-8" "us-ascii" etc
    

    There is an identically named, but incompatible, python-magic pip package on pypi that also uses libmagic. It can also get the encoding, by doing:

    import magic
    
    blob = open('unknown-file').read()
    m = magic.Magic(mime_encoding=True)
    encoding = m.from_buffer(blob)
    

    Some encoding strategies, please uncomment to taste :

    #!/bin/bash
    #
    tmpfile=$1
    echo '-- info about file file ........'
    file -i $tmpfile
    enca -g $tmpfile
    echo 'recoding ........'
    #iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
    #enca -x utf-8 $tmpfile
    #enca -g $tmpfile
    recode CP1250..UTF-8 $tmpfile
    

    You might like to check the encoding by opening and reading the file in a form of a loop... but you might need to check the filesize first :

    encodings = ['utf-8', 'windows-1250', 'windows-1252' ...etc]
                for e in encodings:
                    try:
                        fh = codecs.open('file.txt', 'r', encoding=e)
                        fh.readlines()
                        fh.seek(0)
                    except UnicodeDecodeError:
                        print('got unicode error with %s , trying different encoding' % e)
                    else:
                        print('opening the file with encoding:  %s ' % e)
                        break              
    
    链接地址: http://www.djcxy.com/p/46796.html

    上一篇: 如何检查在python中没有扩展名的文件类型?

    下一篇: 确定Python中文本的编码