How do quickly search through a .csv file in Python
I'm reading a 6 million entry .csv file with Python, and I want to be able to search through this file for a particular entry.
Are there any tricks to search the entire file? Should you read the whole thing into a dictionary or should you perform a search every time? I tried loading it into a dictionary but that took ages so I'm currently searching through the whole file every time which seems wasteful.
Could I possibly utilize that the list is alphabetically ordered? (eg if the search word starts with "b" I only search from the line that includes the first word beginning with "b" to the line that includes the last word beginning with "b")
I'm using import csv
.
(a side question: it is possible to make csv
go to a specific line in the file? I want to make the program start at a random line)
Edit: I already have a copy of the list as an .sql file as well, how could I implement that into Python?
If the csv file isn't changing, load in it into a database, where searching is fast and easy. If you're not familiar with SQL, you'll need to brush up on that though.
Here is a rough example of inserting from a csv into a sqlite table. Example csv is ';' delimited, and has 2 columns.
import csv
import sqlite3
con = sqlite3.Connection('newdb.sqlite')
cur = con.cursor()
cur.execute('CREATE TABLE "stuff" ("one" varchar(12), "two" varchar(12));')
f = open('stuff.csv')
csv_reader = csv.reader(f, delimiter=';')
cur.executemany('INSERT INTO stuff VALUES (?, ?)', csv_reader)
cur.close()
con.commit()
con.close()
f.close()
你可以将内存映射用于真正的大文件
import mmap,os,re
reportFile = open( "big_file" )
length = os.fstat( reportFile.fileno() ).st_size
try:
mapping = mmap.mmap( reportFile.fileno(), length, mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
mapping = mmap.mmap( reportFile.fileno(), 0, None, mmap.ACCESS_READ )
data = mapping.read(length)
pat =re.compile("b.+",re.M|re.DOTALL) # compile your pattern here.
print pat.findall(data)
Well, if your words aren't too big (meaning they'll fit in memory), then here is a simple way to do this (I'm assuming that they are all words).
from bisect import bisect_left
f = open('myfile.csv')
words = []
for line in f:
words.extend(line.strip().split(','))
wordtofind = 'bacon'
ind = bisect_left(words,wordtofind)
if words[ind] == wordtofind:
print '%s was found!' % wordtofind
It might take a minute to load in all of the values from the file. This uses binary search to find your words. In this case I was looking for bacon (who wouldn't look for bacon?). If there are repeated values you also might want to use bisect_right to find the the index of 1 beyond the rightmost element that equals the value you are searching for. You can still use this if you have key:value pairs. You'll just have to make each object in your words list be a list of [key, value].
Side Note
I don't think that you can really go from line to line in a csv file very easily. You see, these files are basically just long strings with n characters that indicate new lines.
链接地址: http://www.djcxy.com/p/55086.html上一篇: Python CSV编写器修剪前导零
下一篇: 如何快速搜索Python中的.csv文件