如何从NLTK中的文本中提取关系
您好,我试图根据第二个最后一个例子,从一串文本中提取关系:https://web.archive.org/web/20120907184244/http://nltk.googlecode.com/svn/trunk/doc /howto/relextract.html
从“出版商周刊”的迈克尔詹姆斯编辑等字符串中,我期望的结果是具有如下输出:
[PER:'Michael James']','ORG:'出版商周刊'的编辑]
做这件事的最好方法是什么? extract_rels期望的格式是什么,以及如何格式化输入以满足该要求?
试图自己做,但它没有奏效。 这是我从书中改编的代码。 我没有得到任何打印结果。 我究竟做错了什么?
class doc():
pass
doc.headline = ['this is expected by nltk.sem.extract_rels but not used in this script']
def findrelations(text):
roles = """
(.*(
analyst|
editor|
librarian).*)|
researcher|
spokes(wo)?man|
writer|
,sofsthe?s* # "X, of (the) Y"
"""
ROLES = re.compile(roles, re.VERBOSE)
tokenizedsentences = nltk.sent_tokenize(text)
for sentence in tokenizedsentences:
taggedwords = nltk.pos_tag(nltk.word_tokenize(sentence))
doc.text = nltk.batch_ne_chunk(taggedwords)
print doc.text
for rel in relextract.extract_rels('PER', 'ORG', doc, corpus='ieer', pattern=ROLES):
print relextract.show_raw_rtuple(rel) # doctest: +ELLIPSIS
文本=“迈克尔詹姆斯出版社周刊编辑”
findrelations(文本)
这里有一个基于你的代码(只有很少的调整),这很好用;)
import nltk
import re
from nltk.chunk import ne_chunk_sents
from nltk.sem import relextract
def findrelations(text):
roles = """
(.*(
analyst|
editor|
librarian).*)|
researcher|
spokes(wo)?man|
writer|
,sofsthe?s* # "X, of (the) Y"
"""
ROLES = re.compile(roles, re.VERBOSE)
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences)
for doc in chunked_sentences:
print doc
for rel in relextract.extract_rels('PER', 'ORG', doc, corpus='ace', pattern=ROLES):
#it is a tree, so you need to work on it to output what you want
print relextract.show_raw_rtuple(rel)
findrelations('Michael James editor of Publishers Weekly')
(S / PERSON迈克尔/ NNP)(PERSON詹姆斯/ NNP)编辑/ NN的/ IN(ORGANIZATION Publishers / NNS Weekly / NNP))
链接地址: http://www.djcxy.com/p/11179.html