C++ replacement for C99 VLAs (goal: preserve performance)

2018-06-05 22:08:54

I am porting some C99 code that makes heavy use of variable length arrays (VLA) to C++.

I replaced the VLAs (stack allocation) with an array class that allocates memory on the heap. The performance hit was huge, a slowdown of a factor of 3.2 (see benchmarks below). What fast VLA replacement can I use in C++? My goal is to minimize performance hit when rewriting the code for C++.

One idea that was suggested to me was to write an array class that contains a fixed-size storage within the class (ie can be stack-allocated) and uses it for small arrays, and automatically switches to heap allocation for larger arrays. My implementation of this is at the end of the post. It works fairly well, but I still cannot reach the performance of the original C99 code. To come close to it, I must increase this fixed-size storage ( MSL below) to sizes which I am not comfortable with. I don't want to allocate too-huge arrays on the stack even for the many small arrays that don't need it because I worry that it will trigger a stack overflow. A C99 VLA is actually less prone to this because it will never use more storage than needed.

I came upon std::dynarray , but my understanding is that it was not accepted into the standard (yet?).

I know that clang and gcc support VLAs in C++, but I need it to work with MSVC too. In fact better portability is one of the main goals of rewriting as C++ (the other goal being making the program, which was originally a command line tool, into a reusable library).

Benchmark

MSL refers to the array size above which I switch to heap-allocation. I use different values for 1D and 2D arrays.

Original C99 code: 115 seconds.
MSL = 0 (ie heap allocation): 367 seconds (3.2x).
1D-MSL = 50, 2D-MSL = 1000: 187 seconds (1.63x).
1D-MSL = 200, 2D-MSL = 4000: 143 seconds (1.24x).
1D-MSL = 1000, 2D-MSL = 20000: 131 (1.14x).

Increasing MSL further improves performance more, but eventually the program will start returning wrong results (I assume due to stack overflow).

These benchmarks are with clang 3.7 on OS X, but gcc 5 shows very similar results.

Code

This is the current "smallvector" implementation I use. I need 1D and 2D vectors. I switch to heap-allocation above size MSL .

template<typename T, size_t MSL=50>
class lad_vector {
    const size_t len;
    T sdata[MSL];
    T *data;
public:
    explicit lad_vector(size_t len_) : len(len_) {
        if (len <= MSL)
            data = &sdata[0];
        else
            data = new T[len];
    }

    ~lad_vector() {
        if (len > MSL)
            delete [] data;
    }

    const T &operator [] (size_t i) const { return data[i]; }
    T &operator [] (size_t i) { return data[i]; }

    operator T * () { return data; }
};


template<typename T, size_t MSL=1000>
class lad_matrix {
    const size_t rows, cols;
    T sdata[MSL];
    T *data;

public:
    explicit lad_matrix(size_t rows_, size_t cols_) : rows(rows_), cols(cols_) {
        if (rows*cols <= MSL)
            data = &sdata[0];
        else
            data = new T[rows*cols];
    }

    ~lad_matrix() {
        if (rows*cols > MSL)
            delete [] data;
    }

    T const * operator[] (size_t i) const { return &data[cols*i]; }
    T * operator[] (size_t i) { return &data[cols*i]; }
};

Create a large buffer (MB+) in thread-local storage. (Actual memory on heap, management in TLS).

Allow clients to request memory from it in FILO manner (stack-like). (this mimics how it works in C VLAs; and it is efficient, as each request/return is just an integer addition/subtraction).

Get your VLA storage from it.

Wrap it pretty, so you can say stack_array<T> x(1024); , and have that stack_array deal with construction/destruction (note that ->~T() where T is int is a legal noop, and construction can similarly be a noop), or make stack_array<T> wrap a std::vector<T, TLS_stack_allocator> .

Data will be not as local as the C VLA data is because it will be effectively on a separate stack. You can use SBO (small buffer optimization), which is when locality really matters.

A SBO stack_array<T> can be implemented with an allocator and a std vector unioned with a std array, or with a unique ptr and custom destroyer, or a myriad of other ways. You can probably retrofit your solution, replacing your new/malloc/free/delete with calls to the above TLS storage.

I say go with TLS as that removes need for synchronization overhead while allowing multi-threaded use, and mirrors the fact that the stack itself is implicitly TLS.

Stack-buffer based STL allocator? is a SO Q&A with at least two "stack" allocators in the answers. They will need some adaption to automatically get their buffer from TLS.

Note that the TLS being one large buffer is in a sense an implementation detail. You could do large allocations, and when you run out of space do another large allocation. You just need to keep track each "stack page" current capacity and a list of stack pages, so when you empty one you can move onto an earlier one. That lets you be a bit more conservative in your TLS initial allocation without worrying about running OOM; the important part is that you are FILO and allocate rarely, not that the entire FILO buffer is one contiguous one.

I think you have already enumerated most options in your question and the comments.

Use std::vector . This is the most obvious, most hassle-free but maybe also the slowest solution.

Use platform-specific extensions on those platforms that provide them. For example, GCC supports variable-length arrays in C++ as an extension. POSIX specifies alloca which is widely supported to allocate memory on the stack. Even Microsoft Windows provides _malloca , as a quick web search told me.

In order to avoid maintenance nightmares, you'll really want to encapsulate these platform dependencies into an abstract interface that automatically and transparently chooses the appropriate mechanism for the current platform. Implementing this for all platforms will be a bit of work but if this single feature accounts for 3 × speed differences as you're reporting, it might be worth it. As a fallback for unknown platforms, I'd keep std::vector in reserve as a last resort. It is better to run slow but correctly than to behave erratic or not run at all.

Build your own variable-sized array type that implements a “small array” optimization embedded as a buffer inside the object itself as you have shown in your question. I'll just note that I'd rather try using a union of a std::array and a std::vector instead of rolling my own container.

Once you have a custom type in place, you can do interesting profiling such as maintaining a global hash table of all occurrences of this type (by source-code location) and recording each allocation size during a stress test of your program. You can then dump the hash table at program exit and plot the distributions in allocation sizes for the individual arrays. This might help you to fine-tune the amount of storage to reserve for each array individually on the stack.

Use a std::vector with a custom allocator. At program startup, allocate a few megabytes of memory and give it to a simple stack allocator. For a stack allocator, allocation is just comparing and adding two integers and deallocation is simply a subtraction. I doubt that the compiler-generated stack allocation can be much faster. Your “array stack” would then pulsate correlated to your “program stack”. This design would also have the advantage that accidental buffer overruns – while still invoking undefined behavior, trashing random data and all that bad stuff – wouldn't as easily corrupt the program stack (return addresses) as they would with native VLAs.

Custom allocators in C++ are a somewhat dirty business but some people do report they're using them successfully. (I don't have much experience with using them myself.) You might want to start looking at cppreference. Alisdair Meredith who is one of those people that promote the usage of custom allocators gave a double-session talk at CppCon'14 titled “Making Allocators Work” (part 1, part 2) that you might find interesting as well. If the std::allocator interface it too awkward to use for you, implementing your own variable (as opposed to dynamically) sized array class with your own allocator should be doable as well.

Regarding support for MSVC:

MSVC has _alloca which allocates stack space. It also has _malloca which allocates stack space if there is enough free stack space, otherwise falls back to dynamic allocation.

You cannot take advantage of the VLA type system, so you would have to change your code to work based in a pointer to first element of such an array.

You may end up needing to use a macro which has different definitions depending on the platform. Eg invoke _alloca or _malloca on MSVC, and on g++ or other compilers, either calls alloca (if they support it), or makes a VLA and a pointer.

Consider investigating ways to rewrite the code without needing to allocate an unknown amount of stack. One option is to allocate a fixed-size buffer that is the maximum you will need. (If that would cause stack overflow it means your code is bugged anyway).

链接地址: http://www.djcxy.com/p/18578.html

上一篇: C ++不会告诉你动态数组的大小。但为什么？

下一篇: C99 VLA的C ++替代（目标：保持性能）