How to decode the filename parameter of Content

2018-06-07 05:20:01

This question provides a background of this filename parameter.

I need to write a script to access some files on a web server. The filename contains CJK characters which cannot be encoded in ASCII.

$ curl -I 'http://bj.baidupcs.com/file/f6f258963f3c5daaa154ed441db232e1?xcode=f5a142e99df965f6a3b4c502a3c55a73283ef282da2f5c14&fid=1107408242-250528-2625488475&time=1373046574&sign=FDTAXER-DCb740ccc5511e5e8fedcff06b081203-QSIMrWw%2FICWQuExpdtyijM0vbMM%3D&to=bb&fm=N,Q,U&expires=8h&rt=sh&r=210487178&logid=3893215518&sh=1'
......
Content-Disposition: attachment;filename="【动漫之家汉化组】[最强会长黑神][第192话][黑神目泷依然健在][END].zip"
......

As you see, cURL decodes the filename properly. Firefox can also figure out the correct filename.

I wrote my script in Python. I tried requests first:

>>> import requests
>>> r=requests.head('http://bj.baidupcs.com/file/f6f258963f3c5daaa154ed441db232e1?xcode=f5a142e99df965f6a3b4c502a3c55a73283ef282da2f5c14&fid=1107408242-250528-2625488475&time=1373046574&sign=FDTAXER-DCb740ccc5511e5e8fedcff06b081203-QSIMrWw%2FICWQuExpdtyijM0vbMM%3D&to=bb&fm=N,Q,U&expires=8h&rt=sh&r=210487178&logid=3893215518&sh=1')
>>> r.headers['content-disposition']
'attachment;filename="ãx80x90åx8a¨æ¼«ä¹x8bå®¶æ±x89åx8cx96ç»x84ãx80x91[æx9cx80å¼ºä¼x9aéx95¿é»x91ç¥x9e][ç¬¬192è¯x9d][é»x91ç¥x9eçx9b®æ³·ä¾x9dçx84¶åx81¥åx9c¨][END].zip"'

The filename looks like a weird representation of Python bytes. The problem is that this whole thing is already a Python string. I can't think of a way to get the actual bytes to decode.

>>> type(r.headers['content-disposition'])
<class 'str'>

The underlying library requests uses is the http.client standard library. I tried it but got the same thing:

>>> import http.client
>>> conn = http.client.HTTPConnection("bj.baidupcs.com")
>>> conn.request('HEAD', '/file/f6f258963f3c5daaa154ed441db232e1?xcode=f5a142e99df965f6a3b4c502a3c55a73283ef282da2f5c14&fid=1107408242-250528-2625488475&time=1373046574&sign=FDTAXER-DCb740ccc5511e5e8fedcff06b081203-QSIMrWw%2FICWQuExpdtyijM0vbMM%3D&to=bb&fm=N,Q,U&expires=8h&rt=sh&r=210487178&logid=3893215518&sh=1')
>>> r=conn.getresponse()
>>> r.getheader('content-disposition')
'attachment;filename="ãx80x90åx8a¨æ¼«ä¹x8bå®¶æ±x89åx8cx96ç»x84ãx80x91[æx9cx80å¼ºä¼x9aéx95¿é»x91ç¥x9e][ç¬¬192è¯x9d][é»x91ç¥x9eçx9b®æ³·ä¾x9dçx84¶åx81¥åx9c¨][END].zip"'

I'm using Python 3 on Windows.

Looks like you're getting a UTF8-encoded (byte) string back as a Python 3 (Unicode) string. You'll have to do something like...

>>> s = 'attachment;filename="ãx80x90åx8a¨æ¼«ä¹x8bå®¶æ±x89åx8cx96ç»x84ãx80x91[æx9cx80å¼ºä¼x9aéx95¿é»x91ç¥x9e][ç¬¬192è¯x9d][é»x91ç¥x9eçx9b®æ³·ä¾x9dçx84¶åx81¥åx9c¨][END].zip"'
>>> s = s.encode('latin-1').decode('utf-8')
>>> s
'attachment;filename="【动漫之家汉化组】[最强会长黑神][第192话][黑神目泷依然健在][END].zip"'

链接地址: http://www.djcxy.com/p/22166.html

上一篇: Cloudberry上的Azure blob不接受内容

下一篇: 如何解码Content的文件名参数