Python subprocess echo a unicode literal
I'm aware that questions like this have been asked before. But I'm not finding a solution.
I want to use a unicode literal, defined in my python file, with the subprocess module. But I'm not getting the results that I need. For example the following code
# -*- coding: utf-8 -*-
import sys
import codecs
import subprocess
cmd = ['echo', u'你好']
new_cmd = []
for c in cmd:
if isinstance(c,unicode):
c = c.encode('utf-8')
new_cmd.append(c)
subprocess.call(new_cmd)
prints out
ä½ å¥½
If I change the code to
# -*- coding: utf-8 -*-
import sys
import codecs
import subprocess
cmd = ['echo', u'你好']
new_cmd = []
for c in cmd:
if isinstance(c,unicode):
c = c.encode(sys.getfilesystemencoding())
new_cmd.append(c)
subprocess.call(new_cmd)
I get the following
??
At this stage I can only assume I'm, repeatedly, making a simple mistake. But I'm having a hard time figuring out what it is. How can I get echo to print out the following when invoked via python's subprocess
你好
Edit:
The version of Python is 2.7. I'm running on Windows 8 but I'd like the solution to be platform independent.
Conclusion: Pay attention to character encodings (there are three different character encodings here). Use Python 3 if you want portable Unicode support (pass arguments as Unicode, don't encode them) or make sure that the data can be represented using current character encodings from the environment (encode using sys.getfilesystemencoding()
on Python 2 as you do in the 2nd code example).
The first code example is incorrect. The effect is the same as (run it in IDLE -- py -3 -midlelib
):
>>> print(u'你好'.encode('utf-8').decode('mbcs')) #XXX DON'T DO IT!
ä½ å¥½
where mbcs
codec uses your Windows ANSI code page (typically: cp1252
character encoding -- it may be different eg, cp1251
on Russian Windows).
Python 2 uses CreateProcess
macros to start a subprocess that is equivalent to CreateProcessA
function there. CreateProcessA
interprets input bytes as being encoded using your Windows ANSI encoding. It is unrelated to the Python source code encoding (utf-8 in your case).
It is expected that you get mojibake if you use a wrong encoding.
Your second code example should work if input characters can be represented using Windows code page such as cp1252
(to enable encoding from Unicode to bytes) and if echo
uses Unicode API to print to Windows console such as WriteConsoleW()
(see Python 3 package win-unicode-console
-- it enables print(u'你好')
whatever your chcp ("OEM") is as long as the font in console supports the characters) or the characters can be represented using OEM code page (used by cmd.exe
) such as cp437
(run chcp
to find out yours). ??
question marks indicate that 你好
can't be represented using your console encoding.
To support arbitrary Unicode arguments (including characters that can't be represented using either Windows ("ANSI") or MS-DOS (OEM) code pages), you need CreateProcessW
function (that is used by Python 3). See Unicode filenames on Windows with Python & subprocess.Popen().
Your first try was the best.
You actually converted the 2 unicode characters u'你好'
(or u'u4f60u597d'
) in UTF8 all that giving b'xe4xbdxa0xe5xa5xbd'
.
You can control it in IDLE that fully support unicode and where b'xe4xbdxa0xe5xa5xbd'.decode('utf-8')
gives back 你好
. Another way to control it is to redirect script output to a file and open it with an UTF-8 compatible editor : there again you will see what you want.
But the problem is that Windows console does not support full unicode. It depends on :
If you know a code page that contains glyphs for your characters (I don't), you can try to insert it in a console with chcp
and explicitely encode your unicode string to that. But on my french machine, I do not know how to do ... except by passing by a text file !
As you spoke of ConEmu, I did it a try ... and it works fine with it, with python 3.4 !
chcp 65001
py -3
import subprocess
cmd = ['cmd', '/c', 'echo', u'u4f60u597d']
subprocess.call(cmd)
gives :
你好
0
The problem is only in the cmd.exe
windows !
上一篇: 在测试中迭代所有Play框架路线