SQL query to remove duplicates from large tables using join

2018-06-08 03:34:24

I am new in using T-SQL queries and I was trying different solutions in order to remove duplicate rows from a fairy large table (with over 270,000 rows). The table looks something like:

TableA
-----------
RowID int not null identity(1,1) primary key,
Col1 varchar(50) not null,
Col2 int not null,
Col3 varchar(50) not null

The rows for this table are not perfect duplicates because of the existence of the RowID identity field. The second table that I need to join with:

TableB
-----------
RowID int not null identity(1,1) primary key,
Col1 int not null,
Col2 varchar(50) not null

In TableA I have something like:

1 | gray | 4     | Angela 
2 | red  | 6     | Diana
3 | black| 6     | Alina
4 | black| 11    | Dana
5 | gray | 4     | Angela 
6 | red  | 12    | Dana
7 | red  | 6     | Diana
8 | black| 11    | Dana

And in TableB:

1 | 6  | klm
2 | 11 | lmi

Second column from TableB (Col1) is foreign key inside TableA (Col2). I need to remove ONLY the duplicates from TableA that has Col2 = 6 ignoring the other duplicates.

    1 | gray | 4     | Angela 
    2 | red  | 6     | Diana
    4 | black| 6     | Alina
    5 | black| 11    | Dana
    6 | gray | 4     | Angela 
    7 | red  | 12    | Dana
    8 | black| 11    | Dana

I tried using

DELETE FROM TableA a inner join TableB b on a.Col2=b.Col1
WHERE a.RowId NOT IN (SELECT MIN(RowId) FROM TableA GROUP BY RowId, Col1, Col2, Col3) and b.Col2="klm"

but I still get some of the duplicates that I need to remove.

What is the best way to remove not perfect duplicate rows using join?

well min would only be one and group by PK will give you everything
and the RowID are wrong in the example

DELETE FROM TableA a 
inner join TableB b 
       on a.Col2=b.Col1
WHERE a.RowId NOT IN (SELECT MIN(RowId) 
                      FROM TableA GROUP BY RowId, Col1, Col2,   Col3) 
and b.Col2="klm"

this would be rows to delete

select * 
from 
(  select *
        , row_number over (partition by Col1, Col3 order by RowID) as rn
   from TableA a
   where del.Col2 = 6
)  tt 
where tt.rn > 1

another solution is:

WITH CTE AS(
               SELECT  t.[col1], t.[col2], t.[col3], t.[col4],
                   RN = ROW_NUMBER() OVER (PARTITION BY t.[col1], t.[col2], t.[col3], t.[col4] ORDER BY t.[col1])
               FROM  [TableA] t
            )
            delete from  CTE WHERE RN > 1

regards.

链接地址: http://www.djcxy.com/p/24736.html

上一篇: 如何访问通过调用TSQL中的存储过程生成的当前作用域中的数据集？

下一篇: SQL查询使用连接从大型表中删除重复项