SQL address data is messy, how to clean it up in a query?
I have address data stored in an sql server 2000 database, and I need to pull out all the addresses for a given customer code. The problem is, there are a lot of misspelled addresses, some with missing parts, etc. So I need to clean this up somehow. I need to weed oout the bad spellings, missing parts, etc and come up with the "average" record. For example, if New York is spelled properly in 4 out of 5 records, that should be the value returned.
I can't modify the data, validate it on input, or anything like that. I can only modify a copy of the data, or manipulate it through a query.
I got a partial answer here Addresses stored in SQL server have many small variations(errors), but I need to allow for multiple valid addresses per code.
Sample Data
Code Name Address1 Address2 City State Zip TimesUsed 10003 AMERICAN NUTRITON INC 2183 BALL STREET OLDEN Utah 87401 177 10003 AMEICAN NUTRITION INC 2183 BALL STREET PO BOX 1504 OLDEN Utah 87402 76 10003 AMERICAN NUTRITION INC 2183 BALL STREET OLDEN Utah 87402 24 10003 AMERICAN NUTRITION INC 2183 BALL STREET PO BOX 1504 OLDEN Utah 87402 17 10003 Samantha Brooks 506 S. Main Street Ellensburg Washington 98296 1 10003 BEMIS COMPANY 1401 W. FOURTH PLAIN BLVD. VANCOUVER Washington 98660 1 10003 CEI 597 VANDYRE BOULEVARD WRIGHTSTOWN Wisconsin 54180 1 10003 Pacific Pet 28th Avenue OLDEN Utah 84401 1 10003 PETSMART, INC. 16091 NORTH 25TH STREET PHOENA Arizona 85027 1 10003 THE PET FIRM 16418 NORTH 37TH STREET PHOENA Arizona 85503 1
Desired Output
Code Name Address1 Address2 City State Zip 10003 AMERICAN NUTRITION INC 2183 BALL AVENUE Olden Utah 84401 10003 Samantha Brooks 506 S. Main Street Ellensburg Washington 98296 10003 BEMIS COMPANY 1401 W. FOURTH PLAIN BLVD. VANCOUVER Washington 98660 10003 CEI 975 VANDYKE ROAD WRIGHTSTOWN Wisconsin 54180 10003 Pacific Pet 29th Street OGDEN Utah 84401 10003 PETSMART, INC. 16091 NORTH 25TH AVENUE PHOENA Arizona 85027 10003 THE PET FIRM 16418 NORTH 37TH STREET PHOENA Arizona 85503
The best solution is to use a CASS certified address standardization program or service that will format and validate the address. Beyond the USPS which has tools for this, there are many third-party programs or services which provide this functionality. Address parsing is far more complicated than you might imagine and thus trying whip up a few queries to do it will be fraught with peril.
Google's Geocoding is another place to look.. Apparently Google requires you display the results to use their Geocoding service. That leaves using dedicated address parsers like the USPS or a third-party program.
Using group by soundex(name)
you will get result like this. You have to test on your data to figure out if this is helpful in your situation or not. I can not test this on SQL Server 2000 so I am not sure if soundex is available.
declare @T table (Code char(5), Name varchar(50), Address1 varchar(50))
insert into @T values
('10003', 'AMERICAN NUTRITON INC', '2183 BALL STREET'),
('10003', 'AMEICAN NUTRITION INC', '2183 BALL STREET'),
('10003', 'AMERICAN NUTRITION INC', '2183 BALL STREET'),
('10003', 'AMERICAN NUTRITION INC', '2183 BALL STREET'),
('10003', 'Samantha Brooks', '506 S. Main Street'),
('10003', 'BEMIS COMPANY', '1401 W. FOURTH PLAIN BLVD.'),
('10003', 'CEI', '597 VANDYRE BOULEVARD'),
('10003', 'Pacific Pet', '28th Avenue'),
('10003', 'PETSMART, INC.', '16091 NORTH 25TH STREET'),
('10003', 'THE PET FIRM', '16418 NORTH 37TH STREET')
select
min(Code) as Code,
min(Name) as Name,
min(Address1) as Address1
from @T
group by soundex(Name)
________________________________________________________
Code Name Address1
10003 AMEICAN NUTRITION INC 2183 BALL STREET
10003 AMERICAN NUTRITION INC 2183 BALL STREET
10003 BEMIS COMPANY 1401 W. FOURTH PLAIN BLVD.
10003 CEI 597 VANDYRE BOULEVARD
10003 Pacific Pet 28th Avenue
10003 PETSMART, INC. 16091 NORTH 25TH STREET
10003 Samantha Brooks 506 S. Main Street
10003 THE PET FIRM 16418 NORTH 37TH STREET
For work, I help write software that does address verification (for SmartyStreets). I'd like to echo Thomas' answer in that the only practical and effective solution would be to use a CASS-Certified vendor. It is highly complicated, but those services will do it for you and do it well.
I'll also add that most free APIs have license restrictions that prevent the use of their service for processing lists of addresses (Google isn't the only one -- even the USPS has restrictions for use of their API).
I would recommend a service like LiveAddress or CASS-Certified Scrubbing for your needs (the latter probably best for an existing table), but I'll let you do your own research so you're more informed. I'll be happy to help you personally with any more address-related questions.
链接地址: http://www.djcxy.com/p/95690.html上一篇: 链接服务器SQL Server 2014到SQL Server版本8
下一篇: SQL地址数据很乱,如何在查询中清理它?