How do quickly search through a .csv file in Python

2018-06-19 13:35:46

I'm reading a 6 million entry .csv file with Python, and I want to be able to search through this file for a particular entry.

Are there any tricks to search the entire file? Should you read the whole thing into a dictionary or should you perform a search every time? I tried loading it into a dictionary but that took ages so I'm currently searching through the whole file every time which seems wasteful.

Could I possibly utilize that the list is alphabetically ordered? (eg if the search word starts with "b" I only search from the line that includes the first word beginning with "b" to the line that includes the last word beginning with "b")

I'm using import csv .

(a side question: it is possible to make csv go to a specific line in the file? I want to make the program start at a random line)

Edit: I already have a copy of the list as an .sql file as well, how could I implement that into Python?

If the csv file isn't changing, load in it into a database, where searching is fast and easy. If you're not familiar with SQL, you'll need to brush up on that though.

Here is a rough example of inserting from a csv into a sqlite table. Example csv is ';' delimited, and has 2 columns.

import csv
import sqlite3

con = sqlite3.Connection('newdb.sqlite')
cur = con.cursor()
cur.execute('CREATE TABLE "stuff" ("one" varchar(12), "two" varchar(12));')

f = open('stuff.csv')
csv_reader = csv.reader(f, delimiter=';')

cur.executemany('INSERT INTO stuff VALUES (?, ?)', csv_reader)
cur.close()
con.commit()
con.close()
f.close()

你可以将内存映射用于真正的大文件

import mmap,os,re
reportFile = open( "big_file" )
length = os.fstat( reportFile.fileno() ).st_size
try:
    mapping = mmap.mmap( reportFile.fileno(), length, mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
    mapping = mmap.mmap( reportFile.fileno(), 0, None, mmap.ACCESS_READ )
data = mapping.read(length)
pat =re.compile("b.+",re.M|re.DOTALL) # compile your pattern here.
print pat.findall(data)

Well, if your words aren't too big (meaning they'll fit in memory), then here is a simple way to do this (I'm assuming that they are all words).

from bisect import bisect_left

f = open('myfile.csv')

words = []
for line in f:
    words.extend(line.strip().split(','))

wordtofind = 'bacon'
ind = bisect_left(words,wordtofind)
if words[ind] == wordtofind:
    print '%s was found!' % wordtofind

It might take a minute to load in all of the values from the file. This uses binary search to find your words. In this case I was looking for bacon (who wouldn't look for bacon?). If there are repeated values you also might want to use bisect_right to find the the index of 1 beyond the rightmost element that equals the value you are searching for. You can still use this if you have key:value pairs. You'll just have to make each object in your words list be a list of [key, value].

Side Note

I don't think that you can really go from line to line in a csv file very easily. You see, these files are basically just long strings with n characters that indicate new lines.

链接地址: http://www.djcxy.com/p/55086.html

上一篇: Python CSV编写器修剪前导零

下一篇: 如何快速搜索Python中的.csv文件