Removing duplicates in lists

Pretty much I need to write a program to check if a list has any duplicates and if it does it removes them and returns a new list with the items that werent duplicated/removed. This is what I have but to be honest I do not know what to do.

def remove_duplicates():
    t = ['a', 'b', 'c', 'd']
    t2 = ['a', 'c', 'd']
    for t in t2:
        t.append(t.remove())
    return t

The common approach to get a unique collection of items is to use a set . Sets are unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in set() function. If you later need a real list again, you can similarly pass the set to the list() function.

The following example should cover whatever you are trying to do:

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]

As you can see from the example result, the original order is not maintained. As mentioned above, sets themselves are unordered collections, so the order is lost. When converting a set back to a list, an arbitrary order is created.

If order is important to you, then you will have to use a different mechanism. A very common solution for this is to rely on OrderedDict to keep the order of keys during insertion:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

Note that this has the overhead of creating a dictionary first, and then creating a list from it. So if you don't actually need to preserve the order, you're better off using a set. Check out this question for more details and alternative ways to preserve the order when removing duplicates.


Finally note that both the set as well as the OrderedDict solution require your items to be hashable. This usually means that they have to be immutable. If you have to deal with items that are not hashable (eg list objects), then you will have to use a slow approach in which you will basically have to compare every item with every other item in a nested loop.


In Python 2.7 , the new way of removing duplicates from an iterable while keeping it in the original order is:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 3.5 , the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.

In Python 3.6 , the regular dict became both ordered and compact. (This feature is holds for CPython and PyPy but may not present in other implementations). That gives us a new fastest way of deduping while retaining order:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 3.7 , the regular dict is guaranteed to both ordered across all implementations. So, the shortest and fastest solution is:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

It's a one-liner: list(set(source_list)) will do the trick.

A set is something that can't possibly have duplicates.

Update: an order-preserving approach is two lines:

from collections import OrderedDict
OrderedDict((x, True) for x in source_list).keys()

Here we use the fact that OrderedDict remembers the insertion order of keys, and does not change it when a value at a particular key is updated. We insert True as values, but we could insert anything, values are just not used. ( set works a lot like a dict with ignored values, too.)

链接地址: http://www.djcxy.com/p/24734.html

上一篇: SQL查询使用连接从大型表中删除重复项

下一篇: 删除列表中的重复项