How to efficiently check neighbor elements for characteristic

I have a text here for example Lorem ipsum lets say I am looking for car with a diesel engine. Realtext is about 11000 words. I am using Python3 and looked into nltk, but didnt find the right idea.

Exampletext:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor my old car has a nice diesel engine sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

Question:

How would i do this efficiently? Can you tell me some text-mining algorithms for further research, for example if I want to search for more than one keyword.

Update1 Begin

I want to find the distance ( other words ) between two words in a text. In my example the distance is 4 (3 words between car and diesel)

Update1 End

My idea so far is to iterate over the list of words and check if the word is a car then I check if 5 words before and 5 after the current word are the same as diesel. In my real code i make some fuzzymatching so you can ignore special cases like "car."

near = 5
textLines = text.splitlines()
words = []
for line in textLines:
    wordlist = line.split(' ')
    for word in wordlist:    
            words.append(word)

for idx, val in enumerate(words):
    if word is 'car': 
        print (idx, val)
        print ("near words")
        for x in range(1,near+1):
            print(words[idx-x])
            # check for diesel
        print("after")
        for x in range(1,near+1):
            print(words[idx+x])
            # check for diesel    

from collections import defaultdict
from nltk import word_tokenize
import math

text = """Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor my old car has a nice diesel engine sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."""

word_index = defaultdict(list)

for i,word in enumerate(word_tokenize(text)):
    word_index[word].append(i)

# First occurence of 'car'
car = word_index['car'][0]
# Indices of 'diesel'
diesel = word_index['diesel']

nearest_diesel = min(diesel, key=lambda x:abs(x-car))

distance = math.fabs(car - nearest_diesel)

print distance
链接地址: http://www.djcxy.com/p/55104.html

上一篇: 你如何从CSV中创建字典密钥?

下一篇: 如何有效地检查相邻元素的特征