Causal Sentences Extraction Using NLTK python

I am extracting causal sentences from the accident reports on water. I am using NLTK as a tool here. I manually created my regExp grammar by taking 20 causal sentence structures [see examples below]. The constructed grammar is of the type

grammar = r'''Cause: {<DT|IN|JJ>?<NN.*|PRP|EX><VBD><NN.*|PRP|VBD>?<.*>+<VBD|VBN>?<.*>+}'''

Now the grammar has 100% recall on the test set ( I built my own toy dataset with 50 causal and 50 non causal sentences) but a low precision. I would like to ask about:

  • How to train NLTK to build the regexp grammar automatically for extracting particular type of sentences.
  • Has any one ever tried to extract causal sentences. Example causal sentences are:

  • There was poor sanitation in the village, as a consequence, she had health problems.

  • The water was impure in her village, For this reason, she suffered from parasites.

  • She had health problems because of poor sanitation in the village. I would want to extract only the above type of sentences from a large text.


  • Had a brief discussion with the author of the book: "Python Text Processing with NLTK 2.0 Cookbook", Mr.Jacob Perkins. He said, "a generalized grammar for sentences is pretty hard. I would instead see if you can find common tag patterns, and use those. But then you're essentially do classification by regexp matching. Parsing is usually used to extract phrases within a sentence, or to produce deep parse trees of a sentence, but you're just trying to identify/extract sentences, which is why I think classification is a much better approach. Consider including tagged words as features when you try this, since the grammar could be significant." taking his suggestions I looked at the causal sentences I had and I found out that these sentences have words like

    consequently
    as a result
    Therefore
    as a consequence
    For this reason
    For all these reasons
    Thus
    because
    since
    because of
    on account of
    due to
    for the reason
    so, that
    

    These words are the connecting cause and effect in a sentence. And now using these connectors it is easy to extract causal sentences.

    链接地址: http://www.djcxy.com/p/65160.html

    上一篇: 如何摆脱标点符号使用NLTK tokenizer?

    下一篇: 使用NLTK python的因果句抽取