Time performance in Generating very large text file in Python

2018-06-11 00:43:19

I need to generate a very large text file. Each line has a simple format:

Seq_num<SPACE>num_val
12343234 759

Let's assume I am going to generate a file with 100million lines. I tried 2 approaches and surprisingly they are giving very different time performance.

For loop over 100m. In each loop I make short string of seq_num<SPACE>num_val , and then I write that to a file. This approach takes a lot of time.

## APPROACH 1  
for seq_id in seq_ids:
    num_val=rand()
    line=seq_id+' '+num_val
    data_file.write(line)

For loop over 100m. In each loop I make short string of seq_num<SPACE>num_val , and then I append this to a list. When loop finishes, I iterate over list items and write each item to a file. This approach takes far less time.

## APPROACH 2  
data_lines=list()
for seq_id in seq_ids:
    num_val=rand()
    l=seq_id+' '+num_val
    data_lines.append(l)
for line in data_lines:
    data_file.write(line)

Note that:

Approach 2 has 2 loops instead of 1 loop.

I write to file in loop for both approach 1 and approach 2. So this step must be same for both.

So approach 1 must take less time. Any hints what I am missing?

Considering APPROACH 2, I think I can assume you have the data for all the lines (or at least in big chunks) before you need to write it to the file.

The other answers are great and it was really formative to read them, but both focused on optimizing the file writing or avoiding the first for loop replacing with list comprehension (that is known to be faster).

They missed the fact that you are iterating in a for loop to write the file, which is not really necessary.

Instead of doing that, by increasing the use of memory (in this case is affordable, since a 100 million line file would be about 600 MB), you can create just one string in a more efficient way by using the formatting or join features of python str, and then write the big string to the file. Also relying on list comprehension to get the data to be formatted.

With loop1 and loop2 of @Tombart 's answer, I get elapsed time 0:00:01.028567 and elapsed time 0:00:01.017042 , respectively.

While with this code:

start = datetime.now()

data_file = open('file.txt', 'w')
data_lines = ( '%i %fn'%(seq_id, random.random()) 
                            for seq_id in xrange(0, 1000000) )
contents = ''.join(data_lines)
data_file.write(contents) 

end = datetime.now()
print("elapsed time %s" % (end - start))

I get elapsed time 0:00:00.722788 which is about a 25% faster.

Notice that data_lines is a generator expression, so the list is not really stored in memory, and the lines are generated and consumed on demand by the join method. This implies the only variable that is significantly occupying memory is contents . This also reduces slightly the running times.

If the text is to large to do all the work in memory, you can always separate in chunks. That is, formatting the string and writing to the file every million lines or so.

Conclusions:

Always try to do list comprehension instead of plain for loops (list comprehension is even faster than filter for filtering lists see here).

If possible by memory or implementation constraints, try to create and encode string contents at once, using the format or join functions.

If possible and the code remains readable, use built-in functions to avoid for loops. For example, using extend function of a list instead of iterating and using append . In fact, both previous points can be seen as examples of this remark.

Remark. Although this answer can be considered useful on its own, it does not completely address the question, which is why the two loops option in the question seems to run faster in some environments. For that, perhaps the @Aiken Drum's answer below can bring some light on that matter.

A lot and far less are technically very vague terms :) Basically if you can't measure it, you can't improve it.

For simplicity let's have a simple benchmark, loop1.py :

import random
from datetime import datetime

start = datetime.now()
data_file = open('file.txt', 'w')
for seq_id in range(0, 1000000):
        num_val=random.random()
        line="%i %fn" % (seq_id, num_val)
        data_file.write(line)

end = datetime.now()
print("elapsed time %s" % (end - start))

loop2.py with 2 for loops:

import random
from datetime import datetime

start = datetime.now()
data_file = open('file.txt', 'w')
data_lines=list()
for seq_id in range(0, 1000000):
    num_val=random.random()
    line="%i %fn" % (seq_id, num_val)
    data_lines.append(line)
for line in data_lines:
    data_file.write(line)

end = datetime.now()
print("elapsed time %s" % (end - start))

When I run these two scripts on my computers (with SSD drive) I'm getting something like:

$ python3 loop1.py 
elapsed time 0:00:00.684282
$ python3 loop2.py 
elapsed time 0:00:00.766182

Each measurement might be slightly different, but as would intuition suggest, the second one is slightly slower.

If we want to optimize writing time, we need to check the manual how Python implements writing into files. For text files the open() function should use BufferedWriter .The open function accepts 3rd arguments that is the buffer size. Here's the interesting part:

Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:

Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device's “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.

So, we can modify the loop1.py and use line buffering:

data_file = open('file.txt', 'w', 1)

this turns out to be very slow:

$ python3 loop3.py 
elapsed time 0:00:02.470757

In order to optimize writing time, we can adjust the buffer size to our needs. First we check the line size in bytes: len(line.encode('utf-8')) , that gives me 11 bytes.

After updating the buffer size to our expected line size in bytes:

data_file = open('file.txt', 'w', 11)

I'm getting quite fast writes:

elapsed time 0:00:00.669622

Based on the details you've provided it's hard to estimate what's going on. Maybe the heuristic for estimating block size doesn't work well on your computer. Anyway if you're writing fixed line length, it's easy to optimize the buffer size. You could further optimize writing to files by leveraging flush() .

Conclusion : Generally for faster writes into a file you should try to write a bulk of data that corresponds to a block size on your file system - which is exactly what is the Python method open('file.txt', 'w') is trying to do. In most cases you're safe with the defaults, differences in microbenchmarks are insignificant.

You're allocating large number of string objects, that needs to be collected by the GC. As suggested by @kevmo314, in order to perform a fair comparison you should disable the GC for loop1.py :

gc.disable()

As the GC might try to remove string objects while iterating over the loop (you're not keeping any reference). While the seconds approach keeps references to all string objects and GC collects them at the end.

Below is an extension to the elegant answer by @Tombart and a few further observations.

With one goal in mind: optimizing the process of reading data from loop(s) and then writing it into a file, let's begin:

I will use the with statement to open/close the file test.txt in all cases. This statement automatically closes the file when the code block within it is executed.

Another important point to consider is the way Python processes text files based on Operating system. From the docs:

Note : Python doesn't depend on the underlying operating system's notion of text files; all the processing is done by Python itself, and is therefore platform-independent.

This means that these results may only slightly vary when executed on a Linux/Mac or Windows OS. The slight variation may result from other processes using the same file at the same time or multiple IO processes happening on the file during the script execution, general CPU processing speed among others.

I present 3 cases with execution times for each and finally find a way to further optimize the most efficient and quick case:

First case: Loop over range(1,1000000) and write to file

import time
import random

start_time = time.time()
with open('test.txt' ,'w') as f:
    for seq_id in range(1,1000000):
        num_val = random.random()    
        line = "%i %fn" %(seq_id, num_val)
        f.write(line)

print('Execution time: %s seconds' % (time.time() - start_time)) 

#Execution time: 2.6448447704315186 seconds

Note : In the two list scenarios below, I have initialized an empty list data_lines like: [] instead of using list() . The reason is: [] is about 3 times faster than list() . Here's an explanation for this behavior: Why is [] faster than list()?. The main crux of the discussion is: While [] is created as bytecode objects and is a single instruction, list() is a separate Python object that also needs name resolution, global function calls and the stack has to be involved to push arguments.

Using the timeit() function in the timeit module, here's the comparison:

import timeit                 import timeit                     
timeit.timeit("[]")           timeit.timeit("list()")
#0.030497061136874608         #0.12418613287039193

Second Case: Loop over range(1,1000000), append values to an empty list and then write to file

import time
import random

start_time = time.time()
data_lines = []
with open('test.txt' ,'w') as f:
    for seq_id in range(1,1000000):
        num_val = random.random()    
        line = "%i %fn" %(seq_id, num_val)
        data_lines.append(line)
    for line in data_lines:
        f.write(line)

print('Execution time: %s seconds' % (time.time() - start_time)) 

#Execution time: 2.6988046169281006 seconds

Third Case: Loop over a list comprehension and write to file

With Python's powerful and compact list comprehensions, it is possible to optimize the process further:

import time
import random

start_time = time.time()

with open('test.txt' ,'w') as f: 
        data_lines = ["%i %fn" %(seq_id, random.random()) for seq_id in range(1,1000000)]
        for line in data_lines:
            f.write(line)

print('Execution time: %s seconds' % (time.time() - start_time))

#Execution time: 2.464804172515869 seconds

On multiple iterations, I've always received a lower execution time value in this case as compared to the previous two cases.

#Iteration 2: Execution time: 2.496004581451416 seconds

Now the question arises: why are list comprehensions( and in general lists ) faster over sequential for loops?

An interesting way to analyze what happens when sequential for loops execute and when list s execute, is to dis assemble the code object generated by each and examine the contents. Here is an example of a list comprehension code object disassembled:

#disassemble a list code object
import dis
l = "[x for x in range(10)]"
code_obj = compile(l, '<list>', 'exec')
print(code_obj)  #<code object <module> at 0x000000058DA45030, file "<list>", line 1>
dis.dis(code_obj)

 #Output:
    <code object <module> at 0x000000058D5D4C90, file "<list>", line 1>
  1           0 LOAD_CONST               0 (<code object <listcomp> at 0x000000058D5D4ED0, file "<list>", line 1>)
          2 LOAD_CONST               1 ('<listcomp>')
          4 MAKE_FUNCTION            0
          6 LOAD_NAME                0 (range)
          8 LOAD_CONST               2 (10)
         10 CALL_FUNCTION            1
         12 GET_ITER
         14 CALL_FUNCTION            1
         16 POP_TOP
         18 LOAD_CONST               3 (None)
         20 RETURN_VALUE

Here's an example of a for loop code object disassembled in a function test :

#disassemble a function code object containing a `for` loop
import dis
test_list = []
def test():
    for x in range(1,10):
        test_list.append(x)


code_obj = test.__code__ #get the code object <code object test at 0x000000058DA45420, file "<ipython-input-19-55b41d63256f>", line 4>
dis.dis(code_obj)
#Output:
       0 SETUP_LOOP              28 (to 30)
              2 LOAD_GLOBAL              0 (range)
              4 LOAD_CONST               1 (1)
              6 LOAD_CONST               2 (10)
              8 CALL_FUNCTION            2
             10 GET_ITER
        >>   12 FOR_ITER                14 (to 28)
             14 STORE_FAST               0 (x)

  6          16 LOAD_GLOBAL              1 (test_list)
             18 LOAD_ATTR                2 (append)
             20 LOAD_FAST                0 (x)
             22 CALL_FUNCTION            1
             24 POP_TOP
             26 JUMP_ABSOLUTE           12
        >>   28 POP_BLOCK
        >>   30 LOAD_CONST               0 (None)
             32 RETURN_VALUE

The above comparison shows more "activity", if I may, in the case of a for loop. For instance, notice the additional function calls to the append() method in the for loop function call. To know more about the parameters in the dis call output, here's the official documentation.

Finally, as suggested before, I also tested with file.flush() and the execution time is in excess of 11 seconds . I add f.flush() before the file.write() statement:

import os
.
.
.
for line in data_lines:
        f.flush()                #flushes internal buffer and copies data to OS buffer
        os.fsync(f.fileno())     #the os buffer refers to the file-descriptor(fd=f.fileno()) to write values to disk
        f.write(line)

The longer execution time using flush() can be attributed to the way data is processed. This function copies the data from the program buffer to the operating system buffer. This means that if a file(say test.txt in this case), is being used by multiple processes and large chunks of data is being added to the file, you will not have to wait for the whole data to be written to the file and the information will be readily available. But to make sure that the buffer data is actually written to disk, you also need to add: os.fsync(f.fileno()) . Now, adding os.fsync() increases the execution time atleast 10 times (I didn't sit through the whole time!) as it involves copying data from buffer to hard disk memory. For more details, go here.

Further Optimization : It is possible to further optimize the process. There are libraries available that support multithreading , create Process Pools and perform asynchronous tasks . This is particularly useful when a function performs a CPU-intensive task & writes to file at the same time. For instance, a combination of threading and list comprehensions gives the fastest possible result(s):

import time
import random
import threading

start_time = time.time()

def get_seq():
    data_lines = ["%i %fn" %(seq_id, random.random()) for seq_id in range(1,1000000)]
    with open('test.txt' ,'w') as f: 
        for line in data_lines:
            f.write(line)

set_thread = threading.Thread(target=get_seq)
set_thread.start()

print('Execution time: %s seconds' % (time.time() - start_time))

#Execution time: 0.015599966049194336 seconds

Conclusion : List comprehensions offer better performance in comparison to sequential for loops and list append s. The primary reason behind this is single instruction bytecode execution in the case of list comprehensions which is faster than the sequential iterative calls to append items to list as in the case of for loops. There is scope for further optimization using asyncio, threading & ProcessPoolExecutor(). You could also use combination of these to achieve faster results. Using file.flush() depends upon your requirement. You may add this function when you need asynchronous access to data when a file is being used by multiple processes. Although, this process may take a long time if you are also writing the data from the program's buffer memory to the OS's disk memory using os.fsync(f.fileno()) .

链接地址: http://www.djcxy.com/p/31696.html

上一篇: 有没有办法将列表理解重写为for循环？

下一篇: 时间表现在Python中生成非常大的文本文件