How to extract quotations from text using NLTK
This question already has an answer here:
As Mayur mentioned, you can do a regex to pick up everything between quotes
list = re.findall("".*?"", string)
The problem you'll run into is that there can be a surprisingly large amount of things between quotation marks that are actually not quotations.
If you're doing academic articles, you can look for a number after the closing quotation to pick up the footnote number. Else with non academic articles, perhaps you could run something like:
"(said|writes|argues|concludes)(,)? ".?""
can be more precise, but risks losing quotes such as blockquotes (blockquotes will cause you problems anyways because they can include a newline before the closing quotation mark)
As for using NLTK, I can't think of anything there that will be of much help other than perhaps wordnet for finding synonyms for "said".
This qualifies as a pattern, ie, data you are looking for is always between quotation marks ""
. Simply put, you can use regex for pattern matching. Let's take this example she said " DAS A SDASD sdasdasd SADSD", " SA23 DSD " ASDAS "ASDAS1 3123$ %$%"
The regex that works for your basic example is -
list = re.findall("".*?"", string)
List
gives us ['" DAS A SDASD SADASD SADSD"', '" SA23 DSD "', '"ASDAS1 3123$ %$%"']
Here, .*?
matches any character (except newline) and the pattern matches the quotation marks (beginning "
and ending "
) literally.
Please beware of the fact that quotation marks within quotation marks breaks this code. You will not get the expected output.
链接地址: http://www.djcxy.com/p/65164.html上一篇: 如何使用nltk从text / pdf中提取段落?
下一篇: 如何使用NLTK从文本中提取报价