Causal Sentences Extraction Using NLTK python
I am extracting causal sentences from the accident reports on water. I am using NLTK as a tool here. I manually created my regExp grammar by taking 20 causal sentence structures [see examples below]. The constructed grammar is of the type
grammar = r'''Cause: {<DT|IN|JJ>?<NN.*|PRP|EX><VBD><NN.*|PRP|VBD>?<.*>+<VBD|VBN>?<.*>+}'''
Now the grammar has 100% recall on the test set ( I built my own toy dataset with 50 causal and 50 non causal sentences) but a low precision. I would like to ask about:
Has any one ever tried to extract causal sentences. Example causal sentences are:
There was poor sanitation in the village, as a consequence, she had health problems.
The water was impure in her village, For this reason, she suffered from parasites.
She had health problems because of poor sanitation in the village. I would want to extract only the above type of sentences from a large text.
Had a brief discussion with the author of the book: "Python Text Processing with NLTK 2.0 Cookbook", Mr.Jacob Perkins. He said, "a generalized grammar for sentences is pretty hard. I would instead see if you can find common tag patterns, and use those. But then you're essentially do classification by regexp matching. Parsing is usually used to extract phrases within a sentence, or to produce deep parse trees of a sentence, but you're just trying to identify/extract sentences, which is why I think classification is a much better approach. Consider including tagged words as features when you try this, since the grammar could be significant." taking his suggestions I looked at the causal sentences I had and I found out that these sentences have words like
consequently
as a result
Therefore
as a consequence
For this reason
For all these reasons
Thus
because
since
because of
on account of
due to
for the reason
so, that
These words are the connecting cause and effect in a sentence. And now using these connectors it is easy to extract causal sentences.
链接地址: http://www.djcxy.com/p/65160.html上一篇: 如何摆脱标点符号使用NLTK tokenizer?
下一篇: 使用NLTK python的因果句抽取