NLTK extracting terms of chunker parse tree
John Edward Grey started running now that he knows he is fat
She was listening to smack that by that awful singer
I want to extract interesting terms from a sentence. I currently use POS tagging to identify grammatical types of each entity. Then I update each token to a counter (with different weights for nouns, verbs and adjectives).
I now wish to use a chunker for this. I think the leaf nodes of the parse tree holds all interesting words and phrases . How do I extract the terms from a chunker output?
In linguistics, the "interesting words" are call open class words
. And the task you are referring to is not really a chunking/parsing task. You are looking for some sort of tagger/annotator/labeller to tag each word to see whether it is "interesting" or not.
Sequence Labelling
If you approach your task as a sequence labelling task, then the sentence John Edward Grey started running now that he knows he is fat
will be tagged as such:
[('John','B'),('Edward','I'),('Grey','I'),('started','O'),('running','B'),
('now','O'),('that','O'),('he','O'),('knows','O'),('he','O'),
('is','O'),('fat','B')]
So anything tagged with B
means a beginning of your "interesting" chunk and
the subsequent word tagged with O
will be the end of the "interesting" chunk or
it can also end up with a subsequent B
to label the end of the previous "interesting" chunk and the start of a new "interesting" chunk.
What is interesting or not?
Actually what is interesting or not depends on what is your ultimate aim of the task, to me, I would have said that started running
is an "interesting" chunk because it started modifies the infinitive meaning or running
to give it a begin action
modality.
Closed class vs Open class words
If you have in mind what are the non-interesting words, then i suggest you build a dictionary of that and then run a sequence labeling script to detect those not in the dictionary of close class words.
Machine learning Approach
Another approach is to perform machine learning classification task where you have already pre-annotated a sample data of what is interesting and what is not. Then you identify some classification features and run the classification to automatically tag the data with B
, I
, O
tags.
上一篇: 在动词/名词/形容词形式之间转换单词
下一篇: NLTK提取chunker解析树的术语