了解std :: hardware

2018-06-28 08:26:42

C ++ 17增加了std::hardware_destructive_interference_size和std::hardware_constructive_interference_size 。首先，我认为这只是获得L1缓存线大小的便携式方法，但这太简单了。

问题：

这些常量如何与L1缓存行大小相关？

有没有一个很好的例子来演示他们的用例？

两者都被定义为static constexpr 。如果您构建二进制文件并在具有不同缓存行大小的其他机器上执行它，那么这不是问题吗？如果您不确定代码将运行在哪台机器上，它如何防止在这种情况下发生虚假共享？

这些常量的意图确实是获得缓存行大小。阅读其理由的最佳地点在于提案本身：

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0154r1.html

我将在这里引用一些基本原理来简化阅读：

[...]不干扰（对一阶）的内存粒度[通常称为高速缓存行大小。

缓存行大小的使用分为两大类：

避免来自不同线程的时间不相交的运行时访问模式的对象之间的破坏性干扰（假分享）。

促进具有暂时本地运行时访问模式的对象之间的建设性干扰（真实共享）。

这个有用的实现数量最显着的问题是尽管它们作为一个群体普遍存在和受欢迎，但在当前实践中用于确定其价值的方法的可移植性是可疑的。 [...]

我们的目标是为这个事业贡献一个适度的发明，这个量的抽象可以通过实现为给定目的保守地定义：

破坏性干扰大小：适合作为两个对象之间的偏移量的数字，以避免由于不同线程的不同运行时访问模式而导致虚假共享。

建设性干扰大小：一个适合作为两个对象的组合内存占用空间大小和基本对齐限制的数字，可能促进它们之间的真实共享。

在这两种情况下，这些值均以实施质量为基础提供，纯粹作为可能提高绩效的提示。这些是使用alignas()关键字的理想便携式值，目前几乎没有标准支持的便携式用途。

“这些常量如何与L1缓存行大小相关？”

理论上，相当直接。

假设编译器确切知道你将要运行的体系结构 - 那么这些几乎肯定会给你L1高速缓存线的大小。（如后文所述，这是一个很大的假设。）

对于它的价值，我几乎总是期望这些值是相同的。我相信他们分开申报的唯一原因是为了完整。（也就是说，编译器可能会估计L2高速缓存行大小，而不是L1高速缓存行大小，但对于建设性干扰，我不知道这实际上是否有用）。

“是否有一个很好的例子来证明它们的用例？”

在这个答案的底部，我附上了一个长期的基准程序，演示了虚假分享和真实分享。

它通过分配一个int包装数组展示虚假共享：在一种情况下，多个元素适合L1高速缓存行，另一个元素占用L1高速缓存行。在紧密的循环中，从阵列中选择一个固定元素并重复更新。

它通过在包装器中分配单个int对来演示真正的共享：在一种情况下，对中的两个int并不适合L1高速缓存行大小，而在另一种情况下，它们可以。在紧密的循环中，该对中的每个元素都会重复更新。

请注意，访问被测对象的代码不会改变; 唯一的区别是对象本身的布局和对齐。

我没有C ++ 17编译器（并且假设大多数人目前都不这样做），所以我用自己的代码替换了有问题的常量。您需要更新这些值才能在您的机器上保持准确。也就是说，64字节可能是典型的现代桌面硬件上的正确值（写作时）。

警告：测试将使用您的机器上的所有内核，并分配〜256MB的内存。不要忘记编译优化！

在我的机器上，输出是：

Hardware concurrency: 16
sizeof(naive_int): 4
alignof(naive_int): 4
sizeof(cache_int): 64
alignof(cache_int): 64
sizeof(bad_pair): 72
alignof(bad_pair): 4
sizeof(good_pair): 8
alignof(good_pair): 4
Running naive_int test.
Average time: 0.0873625 seconds, useless result: 3291773
Running cache_int test.
Average time: 0.024724 seconds, useless result: 3286020
Running bad_pair test.
Average time: 0.308667 seconds, useless result: 6396272
Running good_pair test.
Average time: 0.174936 seconds, useless result: 6668457

我通过避免虚假分享而获得〜3.5倍的加速，并通过确保真实分享来加速〜1.7倍。

“两者都被定义为静态constexpr，如果你构建一个二进制文件并在其他具有不同缓存行大小的机器上执行它，那么这不是一个问题吗？当你不确定代码将在哪台机器上运行时，如何防止在这种情况下发生虚假共享在跑？“

这确实是一个问题。这些常量并不能保证映射到目标机器上的任何缓存行大小，但是这些常量旨在成为编译器能够达到的最佳近似值。

这在提案中已经注明，在附录中，他们举例说明了一些库在编译时基于各种环境提示和宏如何尝试检测缓存行大小。你保证这个值至少是alignof(max_align_t) ，这是一个明显的下限。

换句话说，这个值应该被用作你的后备案例; 如果你知道它，你可以自由定义一个精确的值，例如：

constexpr std::size_t cache_line_size() {
#ifdef KNOWN_L1_CACHE_LINE_SIZE
  return KNOWN_L1_CACHE_LINE_SIZE;
#else
  return std::hardware_destructive_interference_size;
#endif
}

在编译期间，如果您想要假定缓存行大小只需定义KNOWN_L1_CACHE_LINE_SIZE 。

希望这可以帮助！

基准程序：

#include <chrono>
#include <condition_variable>
#include <cstddef>
#include <functional>
#include <future>
#include <iostream>
#include <random>
#include <thread>
#include <vector>

// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_destructive_interference_size = 64;

// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_constructive_interference_size = 64;

constexpr unsigned kTimingTrialsToComputeAverage = 100;
constexpr unsigned kInnerLoopTrials = 1000000;

typedef unsigned useless_result_t;
typedef double elapsed_secs_t;

//////// CODE TO BE SAMPLED:

// wraps an int, default alignment allows false-sharing
struct naive_int {
    int value;
};
static_assert(alignof(naive_int) < hardware_destructive_interference_size, "");

// wraps an int, cache alignment prevents false-sharing
struct cache_int {
    alignas(hardware_destructive_interference_size) int value;
};
static_assert(alignof(cache_int) == hardware_destructive_interference_size, "");

// wraps a pair of int, purposefully pushes them too far apart for true-sharing
struct bad_pair {
    int first;
    char padding[hardware_constructive_interference_size];
    int second;
};
static_assert(sizeof(bad_pair) > hardware_constructive_interference_size, "");

// wraps a pair of int, ensures they fit nicely together for true-sharing
struct good_pair {
    int first;
    int second;
};
static_assert(sizeof(good_pair) <= hardware_constructive_interference_size, "");

// accesses a specific array element many times
template <typename T, typename Latch>
useless_result_t sample_array_threadfunc(
    Latch& latch,
    unsigned thread_index,
    T& vec) {
    // prepare for computation
    std::random_device rd;
    std::mt19937 mt{ rd() };
    std::uniform_int_distribution<int> dist{ 0, 4096 };

    auto& element = vec[vec.size() / 2 + thread_index];

    latch.count_down_and_wait();

    // compute
    for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
        element.value = dist(mt);
    }

    return static_cast<useless_result_t>(element.value);
}

// accesses a pair's elements many times
template <typename T, typename Latch>
useless_result_t sample_pair_threadfunc(
    Latch& latch,
    unsigned thread_index,
    T& pair) {
    // prepare for computation
    std::random_device rd;
    std::mt19937 mt{ rd() };
    std::uniform_int_distribution<int> dist{ 0, 4096 };

    latch.count_down_and_wait();

    // compute
    for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
        pair.first = dist(mt);
        pair.second = dist(mt);
    }

    return static_cast<useless_result_t>(pair.first) +
        static_cast<useless_result_t>(pair.second);
}

//////// UTILITIES:

// utility: allow threads to wait until everyone is ready
class threadlatch {
public:
    explicit threadlatch(const std::size_t count) :
        count_{ count }
    {}

    void count_down_and_wait() {
        std::unique_lock<std::mutex> lock{ mutex_ };
        if (--count_ == 0) {
            cv_.notify_all();
        }
        else {
            cv_.wait(lock, [&] { return count_ == 0; });
        }
    }

private:
    std::mutex mutex_;
    std::condition_variable cv_;
    std::size_t count_;
};

// utility: runs a given function in N threads
std::tuple<useless_result_t, elapsed_secs_t> run_threads(
    const std::function<useless_result_t(threadlatch&, unsigned)>& func,
    const unsigned num_threads) {
    threadlatch latch{ num_threads + 1 };

    std::vector<std::future<useless_result_t>> futures;
    std::vector<std::thread> threads;
    for (unsigned thread_index = 0; thread_index != num_threads; ++thread_index) {
        std::packaged_task<useless_result_t()> task{
            std::bind(func, std::ref(latch), thread_index)
        };

        futures.push_back(task.get_future());
        threads.push_back(std::thread(std::move(task)));
    }

    const auto starttime = std::chrono::high_resolution_clock::now();

    latch.count_down_and_wait();
    for (auto& thread : threads) {
        thread.join();
    }

    const auto endtime = std::chrono::high_resolution_clock::now();
    const auto elapsed = std::chrono::duration_cast<
        std::chrono::duration<double>>(
            endtime - starttime
            ).count();

    useless_result_t result = 0;
    for (auto& future : futures) {
        result += future.get();
    }

    return std::make_tuple(result, elapsed);
}

// utility: sample the time it takes to run func on N threads
void run_tests(
    const std::function<useless_result_t(threadlatch&, unsigned)>& func,
    const unsigned num_threads) {
    useless_result_t final_result = 0;
    double avgtime = 0.0;
    for (unsigned trial = 0; trial != kTimingTrialsToComputeAverage; ++trial) {
        const auto result_and_elapsed = run_threads(func, num_threads);
        const auto result = std::get<useless_result_t>(result_and_elapsed);
        const auto elapsed = std::get<elapsed_secs_t>(result_and_elapsed);

        final_result += result;
        avgtime = (avgtime * trial + elapsed) / (trial + 1);
    }

    std::cout
        << "Average time: " << avgtime
        << " seconds, useless result: " << final_result
        << std::endl;
}

int main() {
    const auto cores = std::thread::hardware_concurrency();
    std::cout << "Hardware concurrency: " << cores << std::endl;

    std::cout << "sizeof(naive_int): " << sizeof(naive_int) << std::endl;
    std::cout << "alignof(naive_int): " << alignof(naive_int) << std::endl;
    std::cout << "sizeof(cache_int): " << sizeof(cache_int) << std::endl;
    std::cout << "alignof(cache_int): " << alignof(cache_int) << std::endl;
    std::cout << "sizeof(bad_pair): " << sizeof(bad_pair) << std::endl;
    std::cout << "alignof(bad_pair): " << alignof(bad_pair) << std::endl;
    std::cout << "sizeof(good_pair): " << sizeof(good_pair) << std::endl;
    std::cout << "alignof(good_pair): " << alignof(good_pair) << std::endl;

    {
        std::cout << "Running naive_int test." << std::endl;

        std::vector<naive_int> vec;
        vec.resize((1u << 28) / sizeof(naive_int));  // allocate 256 mibibytes

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_array_threadfunc(latch, thread_index, vec);
        }, cores);
    }
    {
        std::cout << "Running cache_int test." << std::endl;

        std::vector<cache_int> vec;
        vec.resize((1u << 28) / sizeof(cache_int));  // allocate 256 mibibytes

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_array_threadfunc(latch, thread_index, vec);
        }, cores);
    }
    {
        std::cout << "Running bad_pair test." << std::endl;

        bad_pair p;

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_pair_threadfunc(latch, thread_index, p);
        }, cores);
    }
    {
        std::cout << "Running good_pair test." << std::endl;

        good_pair p;

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_pair_threadfunc(latch, thread_index, p);
        }, cores);
    }
}

链接地址: http://www.djcxy.com/p/79205.html

上一篇: Understanding std::hardware

下一篇: What is the difference between char array vs char pointer in C?