Duplicate File Finder using defaultdict

2018-06-24 23:37:54

I'm experimenting with different ways to identify duplicate files, based on file content, by looping through the top level directory where folders AZ exist. Within folders AZ there is one additional layer of folders named after the current date. Finally, within the dated folders there are between several thousand to several million (<3 million) files in various formats.

Using the script below I was able to process roughly 800,000 files in about 4 hours. However, running it over a larger data set of roughly 13,000,000 files total it consistently breaks on letter "I" that contains roughly 1.5 million files.

Given the size of data I'm dealing with I'm considering outputting the content directly to a text file and then importing it into MySQL or something similar for further processing. Please let me know if I'm going down the right track or if you feel a modified version of the script below should be able to handle 13+ million files.

Question - How can I modify the script below to handle 13+ million files?

Error traceback:

Traceback (most recent call last):
  File "C:/Users/"user"/PycharmProjects/untitled/dups.py", line 28, in <module>
    for subdir, dirs, files in os.walk(path):
  File "C:Python34libos.py", line 379, in walk
    yield from walk(new_path, topdown, onerror, followlinks)
  File "C:Python34libos.py", line 372, in walk
    nondirs.append(name)
MemoryError

my code:

import hashlib
import os
import datetime
from collections import defaultdict


def hash(filepath):
    hash = hashlib.md5()
    blockSize = 65536
    with open(filepath, 'rb') as fpath:
        block = fpath.read(blockSize)
        while len(block) > 0:
            hash.update(block)
            block = fpath.read(blockSize)
    return hash.hexdigest()


directory = "\pathtofiles"
directories = [name for name in os.listdir(directory) if os.path.isdir(os.path.join(directory, name))]
outFile = open("pathoutput.txt", "w", encoding='utf8')

for folder in directories:
    sizeList = defaultdict(list)
    path = directory + folder
    print("Start time: " + str(datetime.datetime.now()))
    print("Working on folder: " + folder)
    # Walk through one level of directories
    for subdir, dirs, files in os.walk(path):
        for file in files:
            filePath = os.path.join(subdir, file)
            sizeList[os.stat(filePath).st_size].append(filePath)
    print("Hashing " + str(len(sizeList)) + " Files")
    ## Hash remaining files
    fileList = defaultdict(list)
    for fileSize in sizeList.values():
        if len(fileSize) > 1:
            for dupSize in fileSize:
                fileList[hash(dupSize)].append(dupSize)
    ## Write remaining hashed files to file
    print("Writing Output")
    for fileHash in fileList.values():
        if len(fileHash) > 1:
            for hashOut in fileHash:
                outFile.write(hashOut + " ~ " + str(os.stat(hashOut).st_size) + 'n')
            outFile.write('n')
outFile.close()
print("End time: " + str(datetime.datetime.now()))

Disclaimer: I don't know if this is a solution.

I looked at your code, and I realized the error is provoked by .walk . Now it's true that this might be because of too much info being processed (so maybe an external DB would help matters, though the added operations might hinder your speed). But other than that, .listdir (which is called by .walk ) is really terrible when you handle a huge amount of files. Hopefully, this is resolved in Python 3.5 because it implements the way better scandir, so if you're willing* to try the latest (and I do mean latest, it was release, what, 8 days ago?), that might help.

Other than that you can try and trace bottlenecks, and garbage collection to maybe figure it out.

*you can also just install it with pip using your current python, but where's the fun in that?

链接地址: http://www.djcxy.com/p/69956.html

上一篇: 如何将字符串转换为列表？

下一篇: 使用defaultdict重复文件查找器