Delimit a specific column and add them as columns in CSV (Python3, CSV)
I have a csv file that has several columns that I first delimit by colon (;). However, ONE column is delimited by a pipe | and I would like to delimit this column and create new columns.
Input:
Column 1 Column 2 Column 3
1 2 3|4|5
6 7 6|7|8
10 11 12|13|14
Desired Output:
Column 1 Column 2 ID Age Height
1 2 3 4 5
6 7 6 7 8
10 11 12 13 14
My code so far delimits the first time by ; and then converts to DF (which is my desired end format)
delimit = list(csv.reader(open('test.csv', 'rt'), delimiter=';'))
df = pd.DataFrame(delimit)
You didn't show exactly what the data looks like (you say it's delimited by semicolons, but your examples don't have any), but if it looks like
Column 1;Column 2;Column 3
1;2;3|4|5
6;7;6|7|8
10;11;12|13|14
You could do something like
>>> df = pd.read_csv("test.csv", sep="[;|]", engine='python', skiprows=1,
names=["Column 1", "Column 2", "ID", "Age", "Height"])
>>> df
Column 1 Column 2 ID Age Height
0 1 2 3 4 5
1 6 7 6 7 8
2 10 11 12 13 14
This works by using a regex separator meaning "either ;
or |
" and forcing the column names manually.
Alternatively, you could do it in a few steps:
>>> df = pd.read_csv("test.csv", sep=";")
>>> df
Column 1 Column 2 Column 3
0 1 2 3|4|5
1 6 7 6|7|8
2 10 11 12|13|14
>>> c3 = df.pop("Column 3").str.split("|", expand=True)
>>> c3.columns = ["ID", "Age", "Height"]
>>> df.join(c3)
Column 1 Column 2 ID Age Height
0 1 2 3 4 5
1 6 7 6 7 8
2 10 11 12 13 14
delimit = list(csv.reader(open('test.csv', 'rt'), delimiter=';'))
for row in delimit:
piped = row.pop()
row.extend(piped.split('|'))
df = pd.DataFrame(delimit)
delimit
最终看起来像:
[
['1', '2', '3', '4', '5'],
['6', '7', '6', '7', '8'],
['10', '11', '12', '13', '14'],
]
It is actually much faster to use the csv lib and str.replace:
import csv
with open("test.txt") as f:
next(f)
# itertools.imap python2
df = pd.DataFrame.from_records(csv.reader(map(lambda x: x.rstrip().replace("|", ";"), f), delimiter=";"),
columns=["Column 1", "Column 2", "ID", "Age", "Height"]).astype(int)
Some timings:
In [35]: %%timeit
pd.read_csv("test.txt", sep="[;|]", engine='python', skiprows=1,
names=["Column 1", "Column 2", "ID", "Age", "Height"])
....:
100 loops, best of 3: 14.7 ms per loop
In [36]: %%timeit
with open("test.txt") as f:
next(f)
df = pd.DataFrame.from_records(csv.reader(map(lambda x: x.rstrip().replace("|", ";"), f),delimiter=";"),
columns=["Column 1", "Column 2", "ID", "Age", "Height"]).astype(int)
....:
100 loops, best of 3: 6.05 ms per loop
You could just str.split:
with open("test.txt") as f:
next(f)
df = pd.DataFrame.from_records(map(lambda x: x.rstrip().replace("|", ";").split(";"), f),
columns=["Column 1", "Column 2", "ID", "Age", "Height"])
链接地址: http://www.djcxy.com/p/30322.html
上一篇: 使用AWS Lambda的OpenCV