Optimizing iteration of list in list on python

2018-06-13 21:48:58

I have a bit of an optimization problem (I am sort of new to python and Stackoverflow).

I am building a word collocation network for a research project. The code I wrote takes a stemmed text without stop words (text_c) and splits it into sentences. For each sentence, it the iterates over the terms in order to build a weighed semantic network that I will then process with NetworkX. This is partly based on a dictionary of the form {'word': digit} (the dic below). The code iterates over the list of existing edges in the network (represented as lists of 3 items).

The problem might be how the loop over the network is growing exponentially (each time a new edge/list is added, the loop increases in size). There are about 110K sentences in the text, so this is taking way too much time (it last took 4 hours to run and didn't finish). There must be a better way of doing this. Would a 'for' statement be more efficient than the look? How would this work?

Thanks!

#determine semantic networks    
outfile = open("00_network_"+str(c)+".csv","a")
network = []
er=0
data = text_c.split(".")
for lines in data:
    linew = lines.split()
    ran = len(linew)
    if ran>3: #sentences of more than three words
        i=0
        while i < ran:
            j = i+1
            while j < ran:
                try:
                    previous_edge = []
                    for n in network:
                        if n[0] == dic[linew[i]] and n[1] == dic[linew[j]]:
                            previous_edge = [n[0],n[1],n[2]]

                    if previous_edge == []:
                        new_edge = [dic[linew[i]],dic[linew[j]],1/((j-i))]
                        network.append(new_edge)
                    else:
                        new_edge = [dic[linew[i]],dic[linew[j]],previous_edge[2]+1/((j-i))]
                        network.remove([previous_edge[0],previous_edge[1],previous_edge[2]])
                        network.append(new_edge)
                except KeyError:
                    er=er+1

                j=j+1
            i=i+1

i and j are not being manipulated inside the loops.

Use for and range

dic[linew[i]] and dic[linex[j]] are compared inside a loop, and the values are being fetched each and every time.

Cache the values outside of the loop

You probably want a break when you have found the previous_edge , saving you from (many) unneeded iterations

Don't test for equality against an empty list. not thislist is enough to know if the list has something.

Don't recreate previous_edge with its 3 values to remove it from the network

# determine semantic networks
outfile = open("00_network_" + str(c) + ".csv", "a")
network = []
er = 0
data = text_c.split(".")
for lines in data:
    linew = lines.split()
    ran = len(linew)
    if ran > 3:  # sentences of more than three words
        # use for and ranges
        for i in range(ran):
            dli = dic[linew[i]]
            for j in range(ran):
                try:
                    previous_edge = []
                    # cache dictionary access before going into for n loop
                    dlj = dic[linew[j]]
                    for n in network:
                        if n[0] == dli and n[1] == dlj:
                            previous_edge = [n[0], n[1], n[2]]
                            # DON'T YOU WANT A BREAK HERE?
                            break

                    if not previous_edge:  # negative test is enough
                        new_edge = [dli, dlj, 1/(j-i)]
                        network.append(new_edge)
                    else:
                        new_edge = [dli, dlj, previous_edge[2] + 1/(j-i)]
                        # DON'T RECREATE a LIST to remove the edge
                        network.remove(previous_edge)
                        network.append(new_edge)
                except KeyError:
                    er = er + 1

链接地址: http://www.djcxy.com/p/39626.html

上一篇: 哪个Vista版本对开发者机器来说是最好的？

下一篇: 在Python中优化列表中的列表迭代