How to extract relationship from text in NLTK

2018-06-03 05:59:20

Hi I'm trying to extract relationships from a string of text based on the second last example here: https://web.archive.org/web/20120907184244/http://nltk.googlecode.com/svn/trunk/doc/howto/relextract.html

From a string such as "Michael James editor of Publishers Weekly" my desired result is to have an output such as:

[PER: 'Michael James'] ', editor of' [ORG: 'Publishers Weekly']

What is the best way to do do this? What format does extract_rels expect and how do I format my input to meet that requirement?

Tried to do it myself but it didn't work. Here is the code I've adapted from the book. I'm not getting any results printed. What am I doing wrong?

class doc():
 pass

doc.headline = ['this is expected by nltk.sem.extract_rels but not used in this script']

def findrelations(text):
roles = """
(.*(                   
analyst|
editor|
librarian).*)|
researcher|
spokes(wo)?man|
writer|
,sofsthe?s*  # "X, of (the) Y"
"""
ROLES = re.compile(roles, re.VERBOSE)
tokenizedsentences = nltk.sent_tokenize(text)
for sentence in tokenizedsentences:
    taggedwords  = nltk.pos_tag(nltk.word_tokenize(sentence))
    doc.text = nltk.batch_ne_chunk(taggedwords)
    print doc.text
    for rel in relextract.extract_rels('PER', 'ORG', doc, corpus='ieer', pattern=ROLES):
        print relextract.show_raw_rtuple(rel) # doctest: +ELLIPSIS

text ="Michael James editor of Publishers Weekly"

findrelations(text)

here a code based on yours (just few adjusts) that work well ;)

import nltk
import re 
from nltk.chunk import ne_chunk_sents
from nltk.sem import relextract


def findrelations(text):
    roles = """
    (.*(                   
    analyst|
    editor|
    librarian).*)|
    researcher|
    spokes(wo)?man|
    writer|
    ,sofsthe?s*  # "X, of (the) Y"
    """
    ROLES = re.compile(roles, re.VERBOSE)

    sentences = nltk.sent_tokenize(text)
    tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
    chunked_sentences = nltk.ne_chunk_sents(tagged_sentences)


    for doc in chunked_sentences:
        print doc
        for rel in relextract.extract_rels('PER', 'ORG', doc, corpus='ace', pattern=ROLES):
            #it is a tree, so you need to work on it to output what you want
            print relextract.show_raw_rtuple(rel) 

findrelations('Michael James editor of Publishers Weekly')

(S (PERSON Michael/NNP) (PERSON James/NNP) editor/NN of/IN (ORGANIZATION Publishers/NNS Weekly/NNP))

链接地址: http://www.djcxy.com/p/11180.html

上一篇: PostgreSQL调优数据仓库的最佳实践

下一篇: 如何从NLTK中的文本中提取关系