How do I automatically fix an invalid JSON string?

From the 2gis API I got the following JSON string.

{
    "api_version": "1.3",
    "response_code": "200",
    "id": "3237490513229753",
    "lon": "38.969916127827",
    "lat": "45.069889625267",
    "page_url": null,
    "name": "ATB",
    "firm_group": {
        "id": "3237499103085728",
        "count": "1"
    },
    "city_name": "Krasnodar",
    "city_id": "3237585002430511",
    "address": "Turgeneva,   172/1",
    "create_time": "2008-07-22 10:02:04 07",
    "modification_time": "2013-08-09 20:04:36 07",
    "see_also": [
        {
            "id": "3237491513434577",
            "lon": 38.973110606808,
            "lat": 45.029031222211,
            "name": "Advance",
            "hash": "5698hn745A8IJ1H86177uvgn94521J3464he26763737242Cf6e654G62J0I7878e",
            "ads": {
                "sponsored_article": {
                    "title": "Center "ADVANCE"",
                    "text": "Business.English."
                },
                "warning": null
            }
        }
    ]
}

But Python doesn't recognize it:

json.loads(firm_str)

Expecting , delimiter: line 1 column 3646 (char 3645)

It looks like a problem with quotes in: "title": "Center "ADVANCE""

How can I fix it automatically in Python?


The answer by @Michael gave me an idea... not a very pretty idea, but it seems to work, at least on your example: Try to parse the JSON string, and if it fails, look for the character where it failed in the exception string and replace that character.

while True:
    try:
        result = json.loads(s)   # try to parse...
        break                    # parsing worked -> exit loop
    except Exception as e:
        # "Expecting , delimiter: line 34 column 54 (char 1158)"
        # position of unexpected character after '"'
        unexp = int(re.findall(r'(char (d+))', str(e))[0])
        # position of unescaped '"' before that
        unesc = s.rfind(r'"', 0, unexp)
        s = s[:unesc] + r'"' + s[unesc+1:]
        # position of correspondig closing '"' (+2 for inserted '')
        closg = s.find(r'"', unesc + 2)
        s = s[:closg] + r'"' + s[closg+1:]
print result

You may want to add some additional checks to prevent this from ending in an infinite loop (eg, at max as many repetitions as there are characters in the string). Also, this will still not work if an incorrect " is actually followed by a comma, as pointed out by @gnibbler.

Update: This seems to work pretty well now (though still not perfect), even if the unescaped " is followed by a comma, or closing bracket, as in this case it will likely get a complaint about a syntax error after that (expected property name, etc.) and trace back to the last " . It also automatically escapes the corresponding closing " (assuming there is one).


If this is exactly what the API is returning then there is a problem with their API. This is invalid JSON. Especially around this area:

"ads": {
            "sponsored_article": {
                "title": "Образовательный центр "ADVANCE"", <-- here
                "text": "Бизнес.Риторика.Английский язык.Подготовка к школе.Подготовка к ЕГЭ."
            },
            "warning": null
        }

The double quotes around ADVANCE are not escaped. You can tell by using something like http://jsonlint.com/ to validate it.

This is a problem with the " not being escaped, the data is bad at the source if this is what you are getting. They need to fix it.

Parse error on line 4:
...азовательный центр "ADVANCE"",         
-----------------------^
Expecting '}', ':', ',', ']'

This fixes the problem:

"title": "Образовательный центр "ADVANCE"",

The only real and definitive solution is to 2gis to fix their API.

In the meantime it is possible to fix the badly encoded JSON escaping double quotes inside strings. If every key-value pair is followed by a newline (as it seems to be from the posted data) the following function will do the job:

def fixjson(badjson):
    s = badjson
    idx = 0
    while True:
        try:
            start = s.index( '": "', idx) + 4
            end1  = s.index( '",n',idx)
            end2  = s.index( '"n', idx)
            if end1 < end2:
                end = end1
            else:
                end = end2
            content = s[start:end]
            content = content.replace('"', '"')
            s = s[:start] + content + s[end:]
            idx = start + len(content) + 6
        except:
            return s

Please, note that some assumtions made:

The function attemps to escape double quotes characters inside value string belonging to key-value pairs.

It is assumed that the text to be escaped begins after the sequence

": "

and ends before the sequence

",n

or

"n

Passing the posted JSON to the function results in this returned value

{
    "api_version": "1.3",
    "response_code": "200",
    "id": "3237490513229753",
    "lon": "38.969916127827",
    "lat": "45.069889625267",
    "page_url": null,
    "name": "ATB",
    "firm_group": {
        "id": "3237499103085728",
        "count": "1"
    },
    "city_name": "Krasnodar",
    "city_id": "3237585002430511",
    "address": "Turgeneva,   172/1",
    "create_time": "2008-07-22 10:02:04 07",
    "modification_time": "2013-08-09 20:04:36 07",
    "see_also": [
        {
            "id": "3237491513434577",
            "lon": 38.973110606808,
            "lat": 45.029031222211,
            "name": "Advance",
            "hash": "5698hn745A8IJ1H86177uvgn94521J3464he26763737242Cf6e654G62J0I7878e",
            "ads": {
                "sponsored_article": {
                    "title": "Center "ADVANCE"",
                    "text": "Business.English."
                },
                "warning": null
            }
        }
    ]
}

Keep in mind you can easily customize the function if your needs are not fully satisfied.

链接地址: http://www.djcxy.com/p/48676.html

上一篇: Python用转义的双引号解析JSON

下一篇: 如何自动修复无效的JSON字符串?