"big" Data csv search from 2 files

I have an computational problem.

I'm using python to iterate over 2 csv files.

csv file1= contains (6-7) columns .. and the important column is an "rs ID" column from dbSNP.

csv file2= 3 columns, 2 of them are important, also the rs ID and a GENE symbol column

my problem:

now i want to search: is an rs ID from csv file 1 IN csv 2 ? if yes, take the gene symbol from csv file 2 and put it into csv file 1 where the match occured (position "x", eg row 4512451).

csv file 1= 1,3 gb, csv file 2 = 8.8 mb

i'm generation a dictionary in python from the csv file 2 and I use it to search in csv file 1.

Problem: for every row(rs ID) in the csv file 1, he iterate through the whole dictionary ( 8.8mb file)

That takes way to much time.... do you know an another approach to get this search faster? I thought a dictionary/hashtable would be good... but it is way to slow.

Maybe creating a Suffix Array from csv file 2 instead of using a dictionary?

Or are there some packages, other data structures in python (vectorization methods)?

I would be very grateful for your help!


Have you tried reading both CSV files into memory? 1.3 GB still seems to be manageable. Than you can put both CSV files into data structures that are much better suited for your problem.

If you choose to do that I would recommend using pandas DataFrames as a container. They can be directly constructed from CSV files. Your search can than be performed quite fast using isin.

链接地址: http://www.djcxy.com/p/55096.html

上一篇: 在Python 2中打印带有字符串和插入变量的随机csv单元格

下一篇: “大”数据csv从2个文件中搜索