How to extract relationship from text in NLTK
Hi I'm trying to extract relationships from a string of text based on the second last example here: https://web.archive.org/web/20120907184244/http://nltk.googlecode.com/svn/trunk/doc/howto/relextract.html
From a string such as "Michael James editor of Publishers Weekly" my desired result is to have an output such as:
[PER: 'Michael James'] ', editor of' [ORG: 'Publishers Weekly']
What is the best way to do do this? What format does extract_rels expect and how do I format my input to meet that requirement?
Tried to do it myself but it didn't work. Here is the code I've adapted from the book. I'm not getting any results printed. What am I doing wrong?
class doc():
pass
doc.headline = ['this is expected by nltk.sem.extract_rels but not used in this script']
def findrelations(text):
roles = """
(.*(
analyst|
editor|
librarian).*)|
researcher|
spokes(wo)?man|
writer|
,sofsthe?s* # "X, of (the) Y"
"""
ROLES = re.compile(roles, re.VERBOSE)
tokenizedsentences = nltk.sent_tokenize(text)
for sentence in tokenizedsentences:
taggedwords = nltk.pos_tag(nltk.word_tokenize(sentence))
doc.text = nltk.batch_ne_chunk(taggedwords)
print doc.text
for rel in relextract.extract_rels('PER', 'ORG', doc, corpus='ieer', pattern=ROLES):
print relextract.show_raw_rtuple(rel) # doctest: +ELLIPSIS
text ="Michael James editor of Publishers Weekly"
findrelations(text)
here a code based on yours (just few adjusts) that work well ;)
import nltk
import re
from nltk.chunk import ne_chunk_sents
from nltk.sem import relextract
def findrelations(text):
roles = """
(.*(
analyst|
editor|
librarian).*)|
researcher|
spokes(wo)?man|
writer|
,sofsthe?s* # "X, of (the) Y"
"""
ROLES = re.compile(roles, re.VERBOSE)
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences)
for doc in chunked_sentences:
print doc
for rel in relextract.extract_rels('PER', 'ORG', doc, corpus='ace', pattern=ROLES):
#it is a tree, so you need to work on it to output what you want
print relextract.show_raw_rtuple(rel)
findrelations('Michael James editor of Publishers Weekly')
(S (PERSON Michael/NNP) (PERSON James/NNP) editor/NN of/IN (ORGANIZATION Publishers/NNS Weekly/NNP))
链接地址: http://www.djcxy.com/p/11180.html下一篇: 如何从NLTK中的文本中提取关系