Merging two strings that overlap

2018-07-01 23:51:14

I am looking into trying to create a full address, but the data I have comes in the form of:

Line 1                     | Line 2                   | Postcode
1, First Street, City, X13
1, First Street             First Street, City          X13 
1                           1, First Street, City, X13  X13

There are a few other permutations of how this data is created, but I want to be able to merge all this into one string where there is no overlap. So I want to create the string:
1, First Street, City, X13

But not 1, First Street, First Street, City, X13 etc.

How can I concat or merge these without duplicating data already there? There are also some cells like on the top line where there is no information past the first cell.

If you have a plain text you can split your text with n in order to get the line and split the lines with , to get the separate fields :

>>> s = """1, First Street, City, X13
... 1, First Street             First Street, City,          X13 
... 1                           1, First Street, City, X13  X13"""
>>> 
>>> lines = s.split('n')
>>> 
>>> splitted_lines = [line.split(',') for line in lines]

Note that as a more pythonic way you can use csv module to read your text by specifying the comma , as the delimiter.

import csv
with open('file_name') as f:
    splitted_lines = csv.reader(f,delimiter=',')

Then you can use following list comprehension to get the unique fields in each column :

>>> import re
>>> ' '.join([set([set(re.split(r's{2,}',i)).pop() for i in column]).pop() for column in zip(*splitted_lines)])
'1  First Street  City'

Note that here you can get the columns using zip() function and then split the items with re.split() with regex r's{2,}' which split your string with 2 or more white-space, then you can sue set() to preserve the unique items.

Note : If you care about the order you can use collections.OrderedDict instead of set

>>> from collections import OrderedDict
>>> 
>>> d = OrderedDict()
>>> ' '.join([d.fromkeys([set(re.split('s{2,}',i)).pop() for i in column]).keys()[0] for column in zip(*splitted_lines)])
'1  First Street  City  X13'

If you don't mind losing punctuation:

from collections import OrderedDict
od = OrderedDict()


from string import punctuation
with open("test.txt") as f:
    next(f)
    print("".join(od.fromkeys(word.strip(punctuation) for line in f    
          for word in line.split())))

1 First Street City X13

If you have repeated words you won't be able to use the approach but based on your input there is no way to know what possible combination are possible bar the second line actually being always intact in which case you would just need pull the second line.

链接地址: http://www.djcxy.com/p/89194.html

上一篇: 在ArangoDB中，使用过滤器查询邻居是否在O（n）中完成？

下一篇: 合并两个重叠的字符串