增加大的效率
我是Python新手,目前正在使用Python 2.我有一些源文件,每个文件都包含大量数据(大约1900万行)。 它看起来像下面这样:
apple t N t apple
n&apos
garden t N t garden
btamd
great t Adj t great
nice t Adj t (unknown)
etc
我的任务是在每个文件的第三列搜索一些目标词,并且每当在语料库中找到目标词时,该词前后的10个词必须被添加到多维词典中。
编辑:应该排除包含'&','''或字符串'(未知)'的行。
我试图用readlines()和enumerate()来解决这个问题,就像你在下面的代码中看到的一样。 代码做了它应该做的事情,但显然对于源文件中提供的数据量来说效率不够高。
我知道readlines()或read()不应该用于庞大的数据集,因为它会将整个文件加载到内存中。 尽管如此,逐行阅读文件,但我没有设法使用枚举方法来获取目标单词前后的10个单词。 我也无法使用mmap,因为我没有权限在该文件上使用它。
所以,我认为readlines方法有一些大小限制是最有效的解决方案。 然而,为了达到这个目的,我不会因为每次达到大小限制的末尾而出现一些错误,因为代码刚刚中断,目标单词不会被捕获到后的10个单词?
def get_target_to_dict(file):
targets_dict = {}
with open(file) as f:
for line in f:
targets_dict[line.strip()] = {}
return targets_dict
targets_dict = get_target_to_dict('targets_uniq.txt')
# browse directory and process each file
# find the target words to include the 10 words before and after to the dictionary
# exclude lines starting with <,-,; to just have raw text
def get_co_occurence(path_file_dir, targets, results):
lines = []
for file in os.listdir(path_file_dir):
if file.startswith('corpus'):
path_file = os.path.join(path_file_dir, file)
with gzip.open(path_file) as corpusfile:
# PROBLEMATIC CODE HERE
# lines = corpusfile.readlines()
for line in corpusfile:
if re.match('[A-Z]|[a-z]', line):
if '(unknown)' in line:
continue
elif '' in line:
continue
elif '&' in line:
continue
lines.append(line)
for i, line in enumerate(lines):
line = line.strip()
if re.match('[A-Z][a-z]', line):
parts = line.split('t')
lemma = parts[2]
if lemma in targets:
pos = parts[1]
if pos not in targets[lemma]:
targets[lemma][pos] = {}
counts = targets[lemma][pos]
context = []
# look at 10 previous lines
for j in range(max(0, i-10), i):
context.append(lines[j])
# look at the next 10 lines
for j in range(i+1, min(i+11, len(lines))):
context.append(lines[j])
# END OF PROBLEMATIC CODE
for context_line in context:
context_line = context_line.strip()
parts_context = context_line.split('t')
context_lemma = parts_context[2]
if context_lemma not in counts:
counts[context_lemma] = {}
context_pos = parts_context[1]
if context_pos not in counts[context_lemma]:
counts[context_lemma][context_pos] = 0
counts[context_lemma][context_pos] += 1
csvwriter = csv.writer(results, delimiter='t')
for k,v in targets.iteritems():
for k2,v2 in v.iteritems():
for k3,v3 in v2.iteritems():
for k4,v4 in v3.iteritems():
csvwriter.writerow([str(k), str(k2), str(k3), str(k4), str(v4)])
#print(str(k) + "t" + str(k2) + "t" + str(k3) + "t" + str(k4) + "t" + str(v4))
results = open('results_corpus.csv', 'wb')
word_occurrence = get_co_occurence(path_file_dir, targets_dict, results)
为了完整性,我复制了整个代码部分,因为它是一个函数的一部分,它从所有提取的信息中创建一个多维字典,然后将其写入一个csv文件。
我真的很感激任何提示或建议,使此代码更有效率。
编辑我更正了代码,以便它将目标单词前后的确切10个单词考虑在内
我的想法是创建一个缓冲区,在10行之前存储,另一个缓冲区在10行之后存储,当文件被读取时,它将被压入缓冲区之前,并且如果大小超过10,缓冲区将弹出
对于后缓冲区,我从文件迭代器1中克隆另一个迭代器。 然后在循环中并行运行两个迭代器,并使用运行10次迭代的克隆迭代器来获取10行之后的代码。
这避免了使用readlines()并加载内存中的整个文件。 希望它适用于你的实际情况
编辑:如果第3列不包含任何'&','','(unknown)',只填充之前的缓冲区。还将split(' t')更改为split(),以便全部处理空格或制表符
import itertools
def get_co_occurence(path_file_dir, targets, results):
excluded_words = ['&', '', '(unknown)'] # modify excluded words here
for file in os.listdir(path_file_dir):
if file.startswith('testset'):
path_file = os.path.join(path_file_dir, file)
with open(path_file) as corpusfile:
# CHANGED CODE HERE
before_buf = [] # buffer to store before 10 lines
after_buf = [] # buffer to store after 10 lines
corpusfile, corpusfile_clone = itertools.tee(corpusfile) # clone file iterator to access next 10 lines
for line in corpusfile:
line = line.strip()
if re.match('[A-Z]|[a-z]', line):
parts = line.split()
lemma = parts[2]
# before buffer handling, fill buffer excluded line contains any of excluded words
if not any(w in line for w in excluded_words):
before_buf.append(line) # append to before buffer
if len(before_buf)>11:
before_buf.pop(0) # keep the buffer at size 10
# next buffer handling
while len(after_buf)<=10:
try:
after = next(corpusfile_clone) # advance 1 iterator
after_lemma = ''
after_tmp = after.split()
if re.match('[A-Z]|[a-z]', after) and len(after_tmp)>2:
after_lemma = after_tmp[2]
except StopIteration:
break # copy iterator will exhaust 1st coz its 10 iteration ahead
if after_lemma and not any(w in after for w in excluded_words):
after_buf.append(after) # append to buffer
# print 'after',z,after, ' - ',after_lemma
if (after_buf and line in after_buf[0]):
after_buf.pop(0) # pop off one ready for next
if lemma in targets:
pos = parts[1]
if pos not in targets[lemma]:
targets[lemma][pos] = {}
counts = targets[lemma][pos]
# context = []
# look at 10 previous lines
context= before_buf[:-1] # minus out current line
# look at the next 10 lines
context.extend(after_buf)
# END OF CHANGED CODE
# CONTINUE YOUR STUFF HERE WITH CONTEXT
用Python 3.5编写的一个功能替代品。 我简化了你的例子,双方只需要5个字。 还有其他有关垃圾价值过滤的简化,但只需要稍作修改。 我将使用PyPI的package fn
来使这个功能代码更加自然。
from typing import List, Tuple
from itertools import groupby, filterfalse
from fn import F
首先,我们需要提取列:
def getcol3(line: str) -> str:
return line.split("t")[2]
然后我们需要将行分割成由谓词分隔的块:
TARGET_WORDS = {"target1", "target2"}
# this is out predicate
def istarget(word: str) -> bool:
return word in TARGET_WORDS
让过滤器垃圾,并写一个功能,采取最后和前5个字:
def isjunk(word: str) -> bool:
return word == "(unknown)"
def first_and_last(words: List[str]) -> (List[str], List[str]):
first = words[:5]
last = words[-5:]
return first, last
现在,让我们看看这些组:
words = (F() >> (map, str.strip) >> (filter, bool) >> (map, getcol3) >> (filterfalse, isjunk))(lines)
groups = groupby(words, istarget)
现在,处理这些组
def is_target_group(group: Tuple[str, List[str]]) -> bool:
return istarget(group[0])
def unpack_word_group(group: Tuple[str, List[str]]) -> List[str]:
return [*group[1]]
def unpack_target_group(group: Tuple[str, List[str]]) -> List[str]:
return [group[0]]
def process_group(group: Tuple[str, List[str]]):
return (unpack_target_group(group) if is_target_group(group)
else first_and_last(unpack_word_group(group)))
最后的步骤是:
words = list(map(process_group, groups))
PS
这是我的测试案例:
from io import StringIO
buffer = """
_t_tword
_t_tword
_t_tword
_t_t(unknown)
_t_tword
_t_tword
_t_ttarget1
_t_tword
_t_t(unknown)
_t_tword
_t_tword
_t_tword
_t_ttarget2
_t_tword
_t_t(unknown)
_t_tword
_t_tword
_t_tword
_t_t(unknown)
_t_tword
_t_tword
_t_ttarget1
_t_tword
_t_t(unknown)
_t_tword
_t_tword
_t_tword
"""
# this simulates an opened file
lines = StringIO(buffer)
鉴于这个文件,你会得到这个输出:
[(['word', 'word', 'word', 'word', 'word'],
['word', 'word', 'word', 'word', 'word']),
(['target1'], ['target1']),
(['word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word']),
(['target2'], ['target2']),
(['word', 'word', 'word', 'word', 'word'],
['word', 'word', 'word', 'word', 'word']),
(['target1'], ['target1']),
(['word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word'])]
从这里你可以删除前5个字和最后5个字。
链接地址: http://www.djcxy.com/p/94009.html