什么可能会导致互斥错误？

2018-06-29 11:59:51

在过去的几个月里，我一直忙于调试在一个非常大的专有C ++图像处理库中的某处发生的罕见崩溃，该库使用GCC 4.7.2编译，用于ARM Cortex-A9 Linux目标。由于一个常见的症状是glibc抱怨堆腐败，第一步是使用堆腐败检查器来捕获oob内存写入。我使用了https://stackoverflow.com/a/17850402/3779334中描述的技术将所有调用都转移到了free / malloc中，并将其分配给每个已分配的内存块，边界写入 - 但什么也没有发现，即使在每个分配的块之前和之后填充1 KB（由于大量使用STL容器而有数十万个分配块），所以我无法进一步放大填充，再加上我认为任何超过1KB的写入越界将最终触发段错误）。这个界限检查器在过去发现了其他问题，所以我不怀疑它的功能。

（在任何人说'Valgrind'之前，是的，我也试过，也没有结果。）

现在，我的内存边界检查器还有一个功能，它将每个分配的块预先分配一个数据结构。这些结构都链接在一个长链表中，以便我偶尔遍历所有分配和测试内存完整性。由于某些原因，即使此列表的所有操作都受到互斥锁保护，列表也会被破坏。在调查这个问题时，似乎互斥体本身偶尔无法完成其工作。这是伪代码：

pthread_mutex_t alloc_mutex;
static bool boolmutex; // set to false during init. volatile has no effect.

void malloc_wrapper() {
  // ...
  pthread_mutex_lock(&alloc_mutex);
  if (boolmutex) {
    printf("mutex misbehavingn");
    __THROW_ERROR__; // this happens!
  }
  boolmutex = true;
  // manipulate linked list here
  boolmutex = false;
  pthread_mutex_unlock(&alloc_mutex);
  // ...
}

代码评论说：“这发生！” 偶尔会达到，尽管这似乎是不可能的。我的第一个理论是互斥数据结构被覆盖。我把互斥体放在一个结构体中，在它之前和之后有大的数组，但是当这个问题发生时，数组没有任何变化，所以似乎没有任何东西被覆盖。

那么..什么样的腐败可能会导致这种情况发生，我如何找到并解决问题的原因？

还有一些笔记。测试程序使用3-4个线程进行处理。使用较少的线程运行似乎使腐败不那么常见，但不会消失。测试每次运行约20秒，并在绝大多数情况下成功完成（我可以有10个单元重复测试，第一次故障发生在5分钟到几个小时后）。当问题发生时，测试时间很晚（比如15秒），所以这不是一个不好的初始化问题。内存边界检查器从来没有捕获到实际的越界写入，但glibc仍然偶尔会因损坏的堆错误而失败（这种错误是否可以由oob写入以外的其他情况引起？）。每次失败都会生成一个包含大量跟踪信息的核心转储; 在这些转储中我没有看到任何模式，没有显示比其他代码更多的特定代码段。这个问题似乎非常特定于一个特定的算法家族，并且在其他算法中不会发生，所以我很确定这不是一个零星的硬件或内存错误。我已经做了很多更多的测试来检查oob堆的访问，我不想列出这些文件来阻止这篇文章再次发布。

预先感谢任何帮助！

感谢所有评论者。当我最终决定编写一个简单的内存分配压力测试 - 一个可以在每个CPU内核上运行线程的单元（我的单元是飞思卡尔i.MX6四核SoC）时，我尝试了几乎所有的建议，但都没有结果。每个都以高速随机顺序分配和释放内存。测试在几分钟或最多几个小时内崩溃，出现glibc内存损坏错误。

将内核从3.0.35更新到3.0.101解决了这个问题; 压力测试和图像处理算法现在都可以在一夜之间运行而不会失败。该问题不会在具有相同内核版本的英特尔机器上重现，因此该问题通常针对ARM，或者可能包含与包含内核3.0.35的特定BSP版本一起提供的某些修补程序。

对于那些好奇的人，附上压力测试源代码。将NUM_THREADS设置为CPU内核数量，然后使用以下命令构建：

<cross-compiler-prefix>g++ -O3 test_heap.cpp -lpthread -o test_heap

我希望这些信息有助于某人。干杯:)

// Multithreaded heap stress test. By Itay Chamiel 20151012.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>
#include <sys/time.h>

#define NUM_THREADS 4 // set to number of CPU cores

#define ALIVE_INDICATOR NUM_THREADS

// Each thread constantly allocates and frees memory. In each iteration of the infinite loop, decide at random whether to
// allocate or free a block of memory. A list of 500-1000 allocated blocks is maintained by each thread. When memory is allocated
// it is added to this list; when freeing, a random block is selected from this list, freed and removed from the list.
void* thr(void* arg) {
    int* alive_flag = (int*)arg;
    int thread_id = *alive_flag; // this is a number between 0 and (NUM_THREADS-1) given by main()
    int cnt = 0;
    timeval t_pre, t_post;
    gettimeofday(&t_pre, NULL);

    const int ALLOCATE=1, FREE=0;
    const unsigned int MINSIZE=500, MAXSIZE=1000;
    const int MAX_ALLOC=10000;
    char* membufs[MAXSIZE];
    unsigned int membufs_size = 0;

    int num_allocs = 0, num_frees = 0;

    while(1)
    {
        int action;
        // Decide whether to allocate or free a memory block.
        // if we have less than MINSIZE buffers, allocate.
        if (membufs_size < MINSIZE) action = ALLOCATE;
        // if we have MAXSIZE, free.
        else if (membufs_size >= MAXSIZE) action = FREE;
        // else, decide randomly.
        else {
            action = ((rand() & 0x1)? ALLOCATE : FREE);
        }

        if (action == ALLOCATE) {
            // choose size to allocate, from 1 to MAX_ALLOC bytes
            size_t size = (rand() % MAX_ALLOC) + 1;
            // allocate and fill memory
            char* buf = (char*)malloc(size);
            memset(buf, 0x77, size);
            // add buffer to list
            membufs[membufs_size] = buf;
            membufs_size++;
            assert(membufs_size <= MAXSIZE);
            num_allocs++;
        }
        else { // action == FREE
            // choose a random buffer to free
            size_t pos = rand() % membufs_size;
            assert (pos < membufs_size);
            // free and remove from list by replacing entry with last member
            free(membufs[pos]);
            membufs[pos] = membufs[membufs_size-1];
            membufs_size--;
            assert(membufs_size >= 0);
            num_frees++;
        }

        // once in 10 seconds print a status update
        gettimeofday(&t_post, NULL);
        if (t_post.tv_sec - t_pre.tv_sec >= 10) {
            printf("Thread %d [%d] - %d allocs %d frees. Alloced blocks %u.n", thread_id, cnt++, num_allocs, num_frees, membufs_size);
            gettimeofday(&t_pre, NULL);
        }

        // indicate alive to main thread
        *alive_flag = ALIVE_INDICATOR;
    }
    return NULL;
}

int main()
{
    int alive_flag[NUM_THREADS];
    printf("Memory allocation stress test running on %d threads.n", NUM_THREADS);
    // start a thread for each core
    for (int i=0; i<NUM_THREADS; i++) {
        alive_flag[i] = i; // tell each thread its ID.
        pthread_t th;
        int ret = pthread_create(&th, NULL, thr, &alive_flag[i]);
        assert(ret == 0);
    }

    while(1) {
        sleep(10);
        // check that all threads are alive
        bool ok = true;
        for (int i=0; i<NUM_THREADS; i++) {
            if (alive_flag[i] != ALIVE_INDICATOR)
            {
                printf("Thread %d is not respondingn", i);
                ok = false;
            }
        }
        assert(ok);
        for (int i=0; i<NUM_THREADS; i++)
            alive_flag[i] = 0;
    }
    return 0;
}

链接地址: http://www.djcxy.com/p/82345.html

上一篇: What could cause a mutex to misbehave?

下一篇: How do I diagnose heap corruption errors on Windows?