Merging vectors without extra memory

2018-06-28 08:23:35

I came across this code segment where two vectors are merged where elements from one vector is favored in case of duplication:

std::vector<String> fields1 = fieldSource1.get();
std::vector<String> fields2 = fieldSource2.get();
// original
fields1.insert(std::end(fields1), std::begin(fields2), std::end(fields2));
std::stable_sort(std::begin(fields1), std::end(fields1));
fields1.erase(std::unique(std::begin(fields1), std::end(fields1)), std::end(fields1));
return fields1;

Given that Strings are unique in their respective vector, and that order of Strings in output vector is irrelevent, I think that I can make this algorithm more efficient.

I would like to avoid extra memory allocation of std::set_union() and std::set_diff().

(Directly inserting std::set_diff to an original vector is not an option due to iterator invalidation during resizing)

I ended up with this, which is std::set_diff with one iterator replaced with an index:

std::sort(std::begin(fields1), std::end(fields1));
std::sort(std::begin(fields2), std::end(fields2));
// Initialize iterators by index in case of resizing
size_t index = 0;
size_t end = std::size(fields1);
std::remove_copy_if(std::begin(fields2), std::end(fields2), std::back_inserter(fields1),
[&fields1, &index, end](String field)->bool{
    auto begin = std::begin(fields1);
    found = std::lower_bound(begin+index, begin+end, field);
    index = std::distance(begin, found);
    return (*found) == field;
});
return fields1;

My question is: can I make this merge operation more efficient? If not, can I make it more readable?

Representing a bunch of strings as a vector is inefficient if you want to keep them in a sorted or mergeable state. Better to use another container such as std::set or std::unordered_set which has much better performance guarantees.

Be aware that any solution that tries to sort strings in place, will probably fragment memory further, and increase memory pressure a lot more than creating the correct data structures in the first place.

If you must keep them as a vector of strings, then you might consider creating a hash table of all the strings that have been seen at each point, and then only permitting strings to be inserted whose hash has not yet been seen. If you have a great deal of duplicates, this method may be more performant than sorting each list independently.

typedef std::size_t hash_type;
typedef std::string value_type;
typedef std::vector< value_type > values_type;
typedef std::hash< value_type > value_hash_type;
typedef std::unordered_set< hash_type > hash_set_type;

bool is_new_hash(hash_set_type &hash_set,
    const hash_type one_hash
    )
{
    if (hash_set.find(one_hash) == hash_set.end())
    {
        hash_set.insert(one_hash);
        return true;
    }
    return false;
}

int main()
{
    values_type str1, str2, dest;
    str1.push_back("c");
    str1.push_back("a");
    str1.push_back("b");

    str2.push_back("c");
    str2.push_back("d");

    hash_set_type hash_set;
    value_hash_type value_hash;

    for (auto &s : str1)
    {
        if (is_new_hash( hash_set, value_hash( s ) ))
            dest.push_back(s);
    }
    for (auto &s : str2)
    {
        if (is_new_hash(hash_set, value_hash(s)))
            dest.push_back(s);
    }
    std::sort(dest.begin(), dest.end());
}

链接地址: http://www.djcxy.com/p/79200.html

上一篇: C ++比较容器中所有对的第一个和第二个元素

下一篇: 合并无额外内存的向量