我怎样才能让我的Python脚本更快？

2018-06-07 19:11:12

我对Python很陌生，并且写了一个（可能非常难看的）脚本，它应该从fastq文件中随机选择一部分序列。一个fastq文件将信息存储在每行四个块中。每个块的第一行以字符“@”开头。我用作输入文件的fastq文件是36 GB，包含大约1400万行。

我试图重写一个已经存在的使用太多内存的脚本，并且我设法减少了很多内存使用。但脚本需要永远运行，我不明白为什么。

parser = argparse.ArgumentParser()
parser.add_argument("infile", type = str, help = "The name of the fastq input file.", default = sys.stdin)
parser.add_argument("outputfile", type = str, help = "Name of the output file.")
parser.add_argument("-n", help="Number of sequences to sample", default=1)
args = parser.parse_args()


def sample():
    linesamples = []
    infile = open(args.infile, 'r')
    outputfile = open(args.outputfile, 'w')
    # count the number of fastq "chunks" in the input file:
    seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
    # randomly select n fastq "chunks":
    seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
    # make a list of the lines that are to be fetched from the fastq file:
    for i in seqsamples:
        linesamples.append(int(4*i+0))
        linesamples.append(int(4*i+1))
        linesamples.append(int(4*i+2))
        linesamples.append(int(4*i+3))
    # fetch lines from input file and write them to output file.
    for i, line in enumerate(infile):
        if i in linesamples:
            outputfile.write(line)

grep-step几乎完全没有时间，但是超过500分钟后，脚本仍然没有开始写入输出文件。所以我想这是grep和最后一个for循环之间需要很长时间的步骤之一。但我不明白哪一步，以及我能做些什么来加快速度。

根据linesamples的大小，因为您正在为通过infile进行的每次迭代搜索列表，因此linesamples中的if i in linesamples将花费很长时间。您可以将其转换为一set来改善查找时间。另外， enumerate不是非常高效 - 我已经用每次迭代增加的line_num构造替换它。

def sample():
    linesamples = set()
    infile = open(args.infile, 'r')
    outputfile = open(args.outputfile, 'w')
    # count the number of fastq "chunks" in the input file:
    seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
    # randomly select n fastq "chunks":
    seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
    for i in seqsamples:
        linesamples.add(int(4*i+0))
        linesamples.add(int(4*i+1))
        linesamples.add(int(4*i+2))
        linesamples.add(int(4*i+3))
    # make a list of the lines that are to be fetched from the fastq file:
    # fetch lines from input file and write them to output file.
    line_num = 0
    for line in infile:
        if line_num in linesamples:
            outputfile.write(line)
        line_num += 1
    outputfile.close()

你说grep运行速度非常快，所以在这种情况下，而不是仅仅使用grep来计算@的出现次数。grep会输出它看到的每个@字符的字节偏移量（对grep使用-b选项）。然后，使用random.sample来选择你想要的块。一旦你选择了你想要的字节偏移量，使用infile.seek去每个字节偏移量并从那里打印出4行。

尝试并行化您的代码。我的意思是这个。你有14,000,000行输入。

先处理你的grep，然后过滤你的行并将其写入filteredInput.txt

将您的filteredInput分割为10.000-100.000行文件，例如filteredInput001.txt，filteredInput002.txt

在这个分割文件上工作我们的代码。将输出写入不同的文件，如output001.txt，output002.txt

作为最后一步合并你的结果。

由于你的代码根本不起作用。您也可以在这些已过滤的输入上运行您的代码。您的代码将检查filteredInput文件的存在，并将了解他所在的步骤，然后从该步骤恢复。

您也可以使用多个python进程（在步骤1之后）使用shell或python线程。

链接地址: http://www.djcxy.com/p/23769.html

上一篇: How can I make my Python script faster?

下一篇: Cython not recognizing c++11 commands