How to extract quotations from text using NLTK

This question already has an answer here:

  • RegEx: Grabbing values between quotation marks 19 answers

  • As Mayur mentioned, you can do a regex to pick up everything between quotes

    list = re.findall("".*?"", string)
    

    The problem you'll run into is that there can be a surprisingly large amount of things between quotation marks that are actually not quotations.

    If you're doing academic articles, you can look for a number after the closing quotation to pick up the footnote number. Else with non academic articles, perhaps you could run something like:

    "(said|writes|argues|concludes)(,)? ".?""
    

    can be more precise, but risks losing quotes such as blockquotes (blockquotes will cause you problems anyways because they can include a newline before the closing quotation mark)

    As for using NLTK, I can't think of anything there that will be of much help other than perhaps wordnet for finding synonyms for "said".


    This qualifies as a pattern, ie, data you are looking for is always between quotation marks "" . Simply put, you can use regex for pattern matching. Let's take this example she said " DAS A SDASD sdasdasd SADSD", " SA23 DSD " ASDAS "ASDAS1 3123$ %$%"

    The regex that works for your basic example is -

    list = re.findall("".*?"", string)
    

    List gives us ['" DAS A SDASD SADASD SADSD"', '" SA23 DSD "', '"ASDAS1 3123$ %$%"']

    Here, .*? matches any character (except newline) and the pattern matches the quotation marks (beginning " and ending " ) literally.

    Please beware of the fact that quotation marks within quotation marks breaks this code. You will not get the expected output.

    链接地址: http://www.djcxy.com/p/65164.html

    上一篇: 如何使用nltk从text / pdf中提取段落?

    下一篇: 如何使用NLTK从文本中提取报价