Why deallocating heap memory is much slower than allocating it?

This is an empirical assumption (that allocating is faster then de-allocating).

This is also one of the reason, i guess, why heap based storages (like STL containers or else) choose to not return currently unused memory to the system (that is why shrink-to-fit idiom was born).

And we shouldn't confuse, of course, 'heap' memory with the 'heap'-like data structures.


So why de-allocation is slower ?

Is it Windows-specific (i see it on Win 8.1) or OS independent?

Is there some C++ specific memory manager automatically involved on using 'new' / 'delete' or the whole mem. management is completely relies on the OS? (i know C++11 introduced some garbage-collection support, which i never used really, better relying on the old stack and static duration or self managed containers and RAII).

Also, in the code of the FOLLY string i saw using old C heap allocation / deallocation, is it faster then C++ 'new' / 'delete'?


PS please note that the question is not about virtual memory mechanics, i understand that user-space programs didn't use real mem. addresation.


I had much the same idea as @Basile: I wondered whether your base assumption was actually (even close to) correct. Since you tagged the question C++, I wrote a quick benchmark in C++ instead.

#include <vector>
#include <iostream>
#include <numeric>
#include <chrono>
#include <iomanip>
#include <locale>

int main() {
    std::cout.imbue(std::locale(""));

    using namespace std::chrono;
    using factor = microseconds;

    auto const size = 2000;

    std::vector<int *> allocs(size);

    auto start = high_resolution_clock::now();

    for (int i = 0; i < size; i++)
        allocs[i] = new int[size];

    auto stop = high_resolution_clock::now();
    auto alloc_time = duration_cast<factor>(stop - start).count();

    start = high_resolution_clock::now();

    for (int i = 0; i < size; i++)
        delete[] allocs[i];

    stop = high_resolution_clock::now();

    auto del_time = duration_cast<factor>(stop - start).count();

    std::cout << std::left << std::setw(20) << "alloc time: " << alloc_time << " uSn";
    std::cout << std::left << std::setw(20) << "del time: " << del_time << " uSn";
}

I also used VC++ on Windows instead of gcc on Linux. The result wasn't much different though: freeing the memory took substantially less time than allocating it did. Here are the results from three successive runs.

alloc time:         2,381 uS
del time:           1,429 uS

alloc time:         2,764 uS
del time:           1,592 uS

alloc time:         2,492 uS
del time:           1,442 uS

I'd warn, however, allocation and freeing is handled (primarily) by the standard library, so this could be different between one standard library and another (even when using the same compiler). I'd also note that it wouldn't surprise me if this were to change somewhat in multi-threaded code. Although it's not actually correct, there appear to be a few authors who are under the mis-apprehension that freeing in a multithreaded environment requires locking a heap for exclusive access. This can be avoided, but the means to do so isn't necessarily immediately obvious.


The assertion that allocating memory is faster than deallocating it seemed a bit odd to me, so I tested it. I ran a test where I allocated 64MB of memory in 32-byte chunks (so 2M calls to new ), and I tried deleting that memory in the same order it was allocated, and in a random order. I found that linear-order deallocation was about 3% faster than allocation, and that random deallocation was about 10% slower than linear allocation.

I then ran a test where I started with 64MB of allocated memory, and then 2M times either allocated new memory or deleted existing memory (at random). Here, I found that deallocation was about 4.3% slower than allocation.

So, it turns out you were correct - deallocation is slower than allocation (though I wouldn't call it "much" slower). I suspect this has simply to do with more random accesses, but I have no evidence for this other than that the linear deallocation was faster.

To answer some of your questions:

Is there some C++ specific memory manager automatically involved on using 'new' / 'delete'?

Yes. The OS has system calls which allocate pages of memory (typically 4KB chunks) to processes. It's the process' job to divide up those pages into objects. Try looking up the "GNU Memory Allocator."

I saw using old C heap allocation / deallocation, is it faster then C++ 'new' / 'delete'?

Most C++ new / delete implementations just call malloc and free under the hood. This is not required by the standard, however, so it's a good idea to always use the same allocation and deallocation function on any particular object.

I ran my tests with the native testing framework provided in Visual Studio 2015, on a Windows 10 64-bit machine (The tests were also 64-bit). Here's the code:

#include "stdafx.h"
#include "CppUnitTest.h"

using namespace Microsoft::VisualStudio::CppUnitTestFramework;

namespace AllocationSpeedTest
{       
    class Obj32 {
        uint64_t a;
        uint64_t b;
        uint64_t c;
        uint64_t d;
    };
    constexpr int len = 1024 * 1024 * 2;
    Obj32* ptrs[len];
    TEST_CLASS(UnitTest1)
    {
    public:
        TEST_METHOD(Linear32Alloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
        }
        TEST_METHOD(Linear32AllocDealloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            for (int i = 0; i < len; ++i) {
                delete ptrs[i];
            }
        }
        TEST_METHOD(Random32AllocShuffle)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                int pos = (rand() % (len - i)) + i;
                Obj32* temp = ptrs[i];
                ptrs[i] = ptrs[pos];
                ptrs[pos] = temp;
            }
        }
        TEST_METHOD(Random32AllocShuffleDealloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                int pos = (rand() % (len - i)) + i;
                Obj32* temp = ptrs[i];
                ptrs[i] = ptrs[pos];
                ptrs[pos] = temp;
            }
            for (int i = 0; i < len; ++i) {
                delete ptrs[i];
            }
        }
        TEST_METHOD(Mixed32Both)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                if (rand() % 2) {
                    ptrs[i] = new Obj32();
                }
                else {
                    delete ptrs[i];
                }
            }
        }
        TEST_METHOD(Mixed32Alloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                if (rand() % 2) {
                    ptrs[i] = new Obj32();
                }
                else {
                    //delete ptrs[i];
                }
            }
        }
        TEST_METHOD(Mixed32Dealloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                if (rand() % 2) {
                    //ptrs[i] = new Obj32();
                }
                else {
                    delete ptrs[i];
                }
            }
        }
        TEST_METHOD(Mixed32Neither)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                if (rand() % 2) {
                    //ptrs[i] = new Obj32();
                }
                else {
                    //delete ptrs[i];
                }
            }
        }
    };
}

And here are the raw results over several runs. All numbers are in milliseconds. 原始结果表


I am not sure of your observation. I wrote the following program (on Linux, hopefully you could port it to your system).

// public domain code
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <errno.h>
#include <string.h>
#include <assert.h>


const unsigned possible_word_sizes[] = {
  1, 2, 3, 4, 5,
  8, 12, 16, 24,
  32, 48, 64, 128,
  256, 384, 2048
};

long long totalsize;

// return a calloc-ed array of nbchunks malloced zones of
// somehow random size
void **
malloc_chunks (int nbchunks)
{
  const int nbsizes =
    (int) (sizeof (possible_word_sizes)
       / sizeof (possible_word_sizes[0]));
  void **ad = calloc (nbchunks, sizeof (void *));
  if (!ad)
    {
      perror ("calloc chunks");
      exit (EXIT_FAILURE);
    };
  for (int ix = 0; ix < nbchunks; ix++)
    {
      unsigned sizindex = random () % nbsizes;
      unsigned size = possible_word_sizes[sizindex];
      void *zon = malloc (size * sizeof (void *));
      if (!zon)
    {
      fprintf (stderr,
           "malloc#%d (%d words) failed (total %lld) %sn",
           ix, size, totalsize, strerror (errno));
      exit (EXIT_FAILURE);
    }
      ((int *) zon)[0] = ix;
      totalsize += size;
      ad[ix] = zon;
    }
  return ad;
}

void
free_chunks (void **chks, int nbchunks)
{
// first, free the two thirds of chunks in random order
  for (int i = 0; 3 * i < 2 * nbchunks; i++)
    {
      int pix = random () % nbchunks;
      if (chks[pix])
    {
      free (chks[pix]);
      chks[pix] = NULL;
    }
    }
// then, free the rest in reverse order
  for (int i = nbchunks - 1; i >= 0; i--)
    if (chks[i])
      {
    free (chks[i]);
    chks[i] = NULL;
      }
}

int
main (int argc, char **argv)
{
  assert (sizeof (int) <= sizeof (void *));
  int nbchunks = (argc > 1) ? atoi (argv[1]) : 32768;
  if (nbchunks < 128)
    nbchunks = 128;
  srandom (time (NULL));
  printf ("nbchunks=%dn", nbchunks);
  void **chks = malloc_chunks (nbchunks);
  clock_t clomall = clock ();
  printf ("clomall=%ld totalsize=%lld wordsn",
      (long) clomall, totalsize);
  free_chunks (chks, nbchunks);
  clock_t clofree = clock ();
  printf ("clofree=%ldn", (long) clofree);
  return 0;
}   

I compiled it with gcc -O2 -Wall mf.c -o mf on my Debian/Sid/x86-64 (i3770k, 16Gb). I run time ./mf 100000 and got:

nbchunks=100000
clomall=54162 totalsize=19115681 words
clofree=83895
./mf 100000  0.02s user 0.06s system 95% cpu 0.089 total

on my system clock gives CPU microseconds. If the call to random is negligible (and I don't know if it is) wrt malloc & free time, I tend to disagree with your observations. free seems to be twice as fast as malloc . My gcc is 6.1, my libc is Glibc 2.22.

Please take time to compile the above benchmark on your system and report the timings.

FWIW, I took Jerry's code and

 g++ -O3 -march=native jerry.cc -o jerry
 time ./jerry;  time ./jerry; time ./jerry

gives

alloc time:         1940516
del time:           602203
./jerry  0.00s user 0.01s system 68% cpu 0.016 total
alloc time:         1893057
del time:           558399
./jerry  0.00s user 0.01s system 68% cpu 0.014 total
alloc time:         1818884
del time:           527618
./jerry  0.00s user 0.01s system 70% cpu 0.014 total
链接地址: http://www.djcxy.com/p/31612.html

上一篇: 如何配置狮身人面像汽车烧瓶文件烧瓶

下一篇: 为什么释放堆内存比分配内存慢得多?