Secure and efficient way to modify multiple files on POSIX systems?

I have been following the discussion on the "bug" on EXT4 that causes files to be zeroed in crash if one uses the "create temp file, write temp file, rename temp to target file" process. POSIX says that unless fsync() is called, you cannot be sure the data has been flushed to harddisk.

Obviously doing:

0) get the file contents (read it or make it somehow)
1) open original file and truncate it
2) write new contents
3) close file

is not good even with fsync() as the computer can crash during 2) or fsync() and you end up with partially written file.

Usually it has been thought that this is pretty safe:

0) get the file contents (read it or make it somehow)
1) open temp file
2) write contents to temp file
3) close temp file
4) rename temp file to original file

Unfortunately it isn't. To make it safe on EXT4 you would need to do:

0) get the file contents (read it or make it somehow)
1) open temp file
2) write contents to temp file
3) fsync()
4) close temp file
5) rename temp file to original file

This would be safe and on crash you should either have the new file contents or old, never zeroed contents or partial contents. But if the application uses lots of files, fsync() after every write would be slow.

So my question is, how to modify multiple files efficiently on a system where fsync() is required to be sure that changes have been saved to disk? And I really mean modifying many files, as in thousands of files. Modifying two files and doing fsync() after each wouldn't be too bad, but fsync() does slow things down when modifying multiple files.

EDIT: changed the fsync() close temp file to corrent order, added emphasis on writing many many many files.


The short answer is: Solving this in the app layer is the wrong place. EXT4 must make sure that after I close the file, the data is written in a timely manner. As it is now, EXT4 "optimizes" this writing to be able to collect more write requests and burst them out in one go.

The problem is obvious: No matter what you do, you can't be sure that your data ends on the disk. Calling fdisk() manually only makes things worse: You basically get in the way of EXT4's optimization, slowing the whole system down.

OTOH, EXT4 has all the information necessary to make an educated guess when it is necessary to write data out to the disk. In this case, I rename the temp file to the name of an existing file. For EXT4, this means that it must either postpone the rename (so the data of the original file stays intact after a crash) or it must flush at once. Since it can't postpone the rename (the next process might want to see the new data), renaming implicitly means to flush and that flush must happen on the FS layer, not the app layer.

EXT4 might create a virtual copy of the filesystem which contains the changes while the disk is not modified (yet). But this doesn't affect the ultimate goal: An app can't know what optimizations the FS if going to make and therefore, the FS must make sure that it does its job.

This is a case where ruthless optimizations have gone too far and ruined the results. Golden rule: Optimization must never change the end result. If you can't maintain this, you must not optimize.

As long as Tso believes that it is more important to have a fast FS rather than one which behaves correctly, I suggest not to upgrade to EXT4 and close all bug reports about this is "works as designed by Tso".

[EDIT] Some more thoughts on this. You could use a database instead of the file. Let's ignore the resource waste for a moment. Can anyone guarantee that the files, which the database uses, won't become corrupted by a crash? Probably. The database can write the data and call fsync() every minute or so. But then, you could do the same:

while True; do sync ; sleep 60 ; done

Again, the bug in the FS prevents this from working in every case. Otherwise, people wouldn't be so bothered by this bug.

You could use a background config daemon like the Windows registry. The daemon would write all configs in one big file. It could call fsync() after writing everything out. Problem solved ... for your configs. Now you need to do the same for everything else your apps write: Text documents, images, whatever. I mean almost any Unix process creates a file. This is the freaking basis of the whole Unix idea!

Clearly, this is not a viable path. So the answer remains: There is no solution on your side. Keep bothering Tso and the other FS developers until they fix their bugs.


我自己的答案是保持对临时文件的修改,并在完成全部写入之后,执行一个fsync(),然后对它们全部重命名。


You need to swap 3 & 4 in your last listing - fsync(fd) uses the file descriptor. and I don't see why that would be particularly costly - you want the data written to disk by the close() anyway. So the cost will be the same between what you want to happen and what will happen with fsync() .

If the cost is too much, (and you have it) fdatasync(2) avoid syncing the meta-data, so should be lighter cost.

EDIT: So I wrote some extremely hacky test code:

#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>

static void testBasic()
{
    int fd;
    const char* text = "This is some text";

    fd = open("temp.tmp", O_WRONLY | O_CREAT);
    write(fd,text,strlen(text));
    close(fd);
    rename("temp.tmp","temp");
}

static void testFsync()
{
    int fd;
    const char* text = "This is some text";

    fd = open("temp1", O_WRONLY | O_CREAT);
    write(fd,text,strlen(text));
    fsync(fd);
    close(fd);
    rename("temp.tmp","temp");
}

static void testFdatasync()
{
    int fd;
    const char* text = "This is some text";

    fd = open("temp1", O_WRONLY | O_CREAT);
    write(fd,text,strlen(text));
    fdatasync(fd);
    close(fd);
    rename("temp.tmp","temp");
}

#define ITERATIONS 10000

static void testLoop(int type)
{
    struct timeval before;
    struct timeval after;
    long seconds;
    long usec;
    int i;

    gettimeofday(&before,NULL);
    if (type == 1)
    {
        for (i = 0; i < ITERATIONS; i++)
        {
            testBasic();
        }
    }
    if (type == 2)
    {
        for (i = 0; i < ITERATIONS; i++)
        {
            testFsync();
        }
    }
    if (type == 3)
    {
        for (i = 0; i < ITERATIONS; i++)
        {
            testFdatasync();
        }
    }
    gettimeofday(&after,NULL);

    seconds = (long)(after.tv_sec - before.tv_sec);
    usec = (long)(after.tv_usec - before.tv_usec);
    if (usec < 0)
    {
        seconds--;
        usec += 1000000;
    }

    printf("%ld.%06ldn",seconds,usec);
}

int main()
{
    testLoop(1);
    testLoop(2);
    testLoop(3);
    return 0;
}

On my laptop that produces:

0.595782
6.338329
6.116894

Which suggests doing the fsync() is ~10 times more expensive. and fdatasync() is slightly cheaper.

I guess the problem I see is that every application is going to think it's data is important enough to fsync(), so the performance advantages of merging writes over a minute will be eliminated.

链接地址: http://www.djcxy.com/p/42368.html

上一篇: 你如何删除一个聚合的部分?

下一篇: 在POSIX系统上修改多个文件的安全且有效的方法?