How to extract quotations from text using NLTK

2018-06-23 05:44:50

This question already has an answer here:

RegEx: Grabbing values between quotation marks 19 answers

As Mayur mentioned, you can do a regex to pick up everything between quotes

list = re.findall("".*?"", string)

The problem you'll run into is that there can be a surprisingly large amount of things between quotation marks that are actually not quotations.

If you're doing academic articles, you can look for a number after the closing quotation to pick up the footnote number. Else with non academic articles, perhaps you could run something like:

"(said|writes|argues|concludes)(,)? ".?""

can be more precise, but risks losing quotes such as blockquotes (blockquotes will cause you problems anyways because they can include a newline before the closing quotation mark)

As for using NLTK, I can't think of anything there that will be of much help other than perhaps wordnet for finding synonyms for "said".

This qualifies as a pattern, ie, data you are looking for is always between quotation marks "" . Simply put, you can use regex for pattern matching. Let's take this example she said " DAS A SDASD sdasdasd SADSD", " SA23 DSD " ASDAS "ASDAS1 3123$ %$%"

The regex that works for your basic example is -

list = re.findall("".*?"", string)

List gives us ['" DAS A SDASD SADASD SADSD"', '" SA23 DSD "', '"ASDAS1 3123$ %$%"']

Here, .*? matches any character (except newline) and the pattern matches the quotation marks (beginning " and ending " ) literally.

Please beware of the fact that quotation marks within quotation marks breaks this code. You will not get the expected output.

链接地址: http://www.djcxy.com/p/65164.html

上一篇: 如何使用nltk从text / pdf中提取段落？

下一篇: 如何使用NLTK从文本中提取报价