fsync() on a different thread: apparently a useless trick

Monday, 03 May 10
fsync() is the kind of system call you love and hate at the same time. In many applications it's nice to know that kernel buffers are flushed to disk (even if this alone does not necessarily guarantees data is actually written to the disk, as the disk itself can have caching layers), but unfortunately fsync() tends to be monkey assess slow.

As I like numbers, slow is, for instance, 55 milliseconds against a small file with not so much writes, while the disk is idle. Slow means a few seconds when the disk is busy and there is some serious amount of data to flush.

With some application this is not a problem. For instance when you save your edited file in vim the worst that can happen is some delay before the editor will quit. But there are applications where both speed and persistence guarantees are required, especially when we talk about databases.

Like in my specific case: Redis supports a persistence mode called Append Only File, where every change to the dataset is written on disk before reporting a success status code to the client performing the operation. In this kind of application it is desirable to fsync() in order to make sure the data is actually written on disk, in the event of a system crash or alike.

Since fsyncing is slow, Redis allows the user to select among three different fsync policies:
  • fsync never: just let the kernel doing it when needed. In Linux this usually means that data will be flushed on disk at max in 30 seconds. But you can change the kernel settings to change this defaults if needed.
  • fsync everysec: call fsync every second.
  • fsync always: call fsync after every write operation against the Append Only File, and before reporting a success status code to the client.
The first option is the faster, the second is almost as fast as the first but much safer, the third is so slow to be basically impossible to use, at the point I'm thinking about dropping it.

The "fsync everysec" policy is a very good compromise and works well in practice if the disk is not too much busy serving other processes, but since in this mode we just need to sync every second without our sync being blocking from the point of view of reporting the successful status code to the client, an obvious thing to do is moving the fsync call into another thread. Doing things in this way, in theory, when from time to time an fsync will take too much as the disk is busy, no one will notice and the latency from the point of view of the client talking with the Redis server will be good as usually.

Sounds cool right? But I started to have the feeling that this would be totally useless, as the write(2) call would block anyway if there was a slow fsync() going on against the same file, so I wrote the following test program:



The program is pretty simple. It starts one thread doing an fsync() call every second, while the other (main) thread does a write 10 times per second. Both syscalls are benchmarked in order to check if when a slow fsync() is in progress the write() will also block for the same time.

The output speaks for itself:
...
Write in 11 microseconds
Write in 12 microseconds
Write in 12 microseconds
Write in 12 microseconds
Sync in 40523 microseconds (0)
Write in 30596 microseconds
Write in 11 microseconds
Write in 11 microseconds
Write in 11 microseconds
Write in 11 microseconds
...
Unfortunately my suspicious is confirmed. This is really counter intuitive since after all we are talking about flushing buffers on disk. When this operation is started the kernel could allocate new buffers that will be used by new write(2) calls, so my guess is, this is a Linux limitation, not something that must be this way. (Note: you may need to run the program for some minute before the two threads are in sync enough to show that behavior).

What about fdatasync()?

Since this behavior seemed so strange I started wondering if fsync() actually blocks all the other writes until the buffers are not flushed on disk because it is required to also flush metadata. So I tried the same thing with fdatasync(), that is much faster, unfortunately it just takes some more time to see the same behavior because fdatasync() calls are usually much faster, but from time to time I was able to see this happening again:
Write in 13 microseconds
Write in 13 microseconds
Write in 14 microseconds
Write in 14 microseconds
Write in 12 microseconds
Write in 13 microseconds
Write in 12 microseconds
Sync in 48649 microseconds (0)
Write in 47213 microseconds
Write in 13 microseconds
Write in 10 microseconds
Write in 13 microseconds

Conclusions

If you have a Linux write intensive application and are thinking about calling fsync() in another thread in order to avoid blocking, don't do it, it's completely useless with the current kernel implementation.

If you are a kernel hacker and know why Linux is behaving in an apparently lame way about this, please make me know.

Edit: Same test of the above, but instead of the fsync()ing thread the file is opened with O_SYNC:
...
Write in 219 microseconds
Write in 253 microseconds
Write in 264 microseconds
Write in 271 microseconds
Write in 246 microseconds
Write in 259 microseconds
Write in 251 microseconds
Write in 232 microseconds
Write in 235 microseconds
Write in 251 microseconds
...


Very interesting. Every write takes more than 20 times more time, but it's much faster blocking 2500 us every 10 writes compared to the big stop-the-world-for-40000-us every 10 writes with fsync(). So we have a clear winner here for "fsync always". Still no better solution of the current one for "fsync everysec" but this is working pretty well already.
157479 views*
Posted at 03:40:11 | permalink | 33 comments | print
Do you like this article?
Subscribe to the RSS feed of this blog or use the newsletter service in order to receive a notification every time there is something of new to read here.

Note: you'll not see this box again if you are a usual reader.

Comments

jani writes:
03 May 10, 04:44:13
Works fine for me, all write times are small and quiet uniforms.

Have you tried passing O_NONBLOCK to open() in the write thread?

Write in 11 microseconds
Write in 14 microseconds
Write in 14 microseconds
Write in 21 microseconds
Write in 15 microseconds
Write in 15 microseconds
Write in 15 microseconds
Write in 15 microseconds
Write in 15 microseconds
Write in 15 microseconds
Sync in 456 microseconds (0)
antirez writes:
03 May 10, 04:49:57
@jani: you may need to run this for some time, or to speedup the collision try with lower usleep() time in the fsync-ing thread.
antirez writes:
03 May 10, 04:51:49
@jani: about NON_BLOCK, it should not be related, the operation is blocking but it is in another thread, the problem is that at kernel level Linux appears to lock the file for some reason while a costly flush is happening.

If you try to copy a big file while running the test you can notice it much more easily.
kosaki writes:
03 May 10, 04:58:11
Hi

Your write seems change file size. It mean fdatasync
behave as fsync. can you please try sync_file_range or
fallocate?

FWIW, The exclusion between fsync and write is necessary for
avoiding livelock. please remember, fsync on some filesystem
(e.g. nfs) is very slow. if we allow concurrent write, fsync
might be never end.

see linux/fs/sync.c
------------------------------
int vfs_fsync_range(struct file *file, struct dentry *dentry, loff_t start,
loff_t end, int datasync)
{
(snip)
/*
* We need to protect against concurrent writers, which could cause
* livelocks in fsync_buffers_list().
*/
mutex_lock(&mapping->host->i_mutex);
err = fop->fsync(file, dentry, datasync);
if (!ret)
ret = err;
mutex_unlock(&mapping->host->i_mutex);
(snip)
}

disclaimer: I'm linux kernel developer, but I'm NOT fs
developer. then, perhaps I'm saying completely wrong
thing ;)
kosaki writes:
03 May 10, 05:06:18
And, I'm guessing now you use ext3 filesystem.
ext4 have great improvement in this area. To run
your test program on ext4 got

Write in 107 microseconds
Sync in 25051 microseconds (0)
Write in 63 microseconds
Write in 46 microseconds
Write in 33 microseconds
Write in 40 microseconds
Write in 139 microseconds
Write in 38 microseconds
Write in 39 microseconds
Write in 32 microseconds
Write in 39 microseconds
Write in 41 microseconds
Sync in 20932 microseconds (0)
Write in 79 microseconds
Write in 42 microseconds
Write in 53 microseconds
Write in 44 microseconds
Write in 35 microseconds
Write in 50 microseconds
Write in 35 microseconds
Write in 35 microseconds
Write in 42 microseconds
Write in 56 microseconds

Thanks.
antirez writes:
03 May 10, 05:09:06
@kosaki: thanks for your comment!

I can avoid synching metadata without problems, but if I understand this correctly, in an append only file context, not synching the size seems equivalent to leaking data if a crash occurs I guess... so this is not possible in my context.

fallocate: this seems interesting in order to speed up the AOF indeed. Thanks. Still probably not able to fix my problem.

About lifelocks, I think that flushing the buffers as they are at the time fsync() is called is more than enough in many applications. It's ok that new data arriving after the call will be flushed in the next fsync() in many applications I guess. So this is a strange behavior form the point of view of the least surprise, but maybe this is just the standard.

Still as I can see it is clearly stated in the source code. Strange indeed.

Thank you very much for the informative comment!
antirez writes:
03 May 10, 05:10:28
@kosaki: great news about ext4! I'll try to perform some benchmark with it. Thanks again.
antirez writes:
03 May 10, 05:24:05
Update: all my tests were done in ext4. I used to think it was an ext3 install but this is not the case.
jani writes:
03 May 10, 05:41:55
indeed, I can reproduce now if I copy a large file in the same time.

aio_write (even though it may do the same thing in the background) does not show the delays by copying the same big file

#include <aio.h>
...
struct aiocb aio;
...

aio.aio_fildes = fd;
aio.aio_buf = "x";
aio.aio_nbytes = 1;
while(1) {
start = microseconds();
if (aio_write(&aio) == -1) {
Paul writes:
03 May 10, 05:53:02
Ran this on my OSX machine, here's the output:

Write in 33 microseconds
Sync in 559 microseconds (0)
Write in 51 microseconds
Write in 52 microseconds
Write in 52 microseconds
Write in 47 microseconds
Write in 52 microseconds
Write in 51 microseconds
Write in 49 microseconds
Write in 51 microseconds
Write in 49 microseconds
Sync in 781 microseconds (0)
Write in 39 microseconds
Write in 51 microseconds
Write in 50 microseconds
Write in 49 microseconds
Write in 54 microseconds
Write in 50 microseconds
Write in 50 microseconds
Write in 51 microseconds
Write in 51 microseconds
Write in 49 microseconds
Sync in 641 microseconds (0)
Write in 29 microseconds
Write in 51 microseconds

I know, darwin not linux, just wanted to show it off :p
antirez writes:
03 May 10, 05:56:32
@jani: thanks!

@Paul: yes definitely fsync() on Darwin is much faster... this is why the lame behavior of "fsync always" was unnoticed for so many time, I used to develop primarily on Mac OS X and test it on Linux from time to time. Now VMware provided a great Linux box I'm using as my primary development box.
Uriel Katz writes:
03 May 10, 07:19:09
here is a simple solution,don`t use write,copy the data to some userland buffer and then the thread will write them and fsync,so writing and fsync is done only on the IO thread,the thread that accepts connections only write to memory which is fast and non-blocking
antirez writes:
03 May 10, 07:46:48
@Uriel: it's a solution indeed, but replying after write(2) or replying after just writing some data into an user-space buffer is pretty different from the point of view of durability.

For instance with the current implementation even with "fsync never" policy it's possible to hack with /proc settings in order to get decent compromises between flush times and performances. Also this only solves the problem with "fsync everysecond" that is already working pretty well.

My current idea is that the best thing to do is using O_SYNC for "fsync always" and just use fdatasync() without a thread for "fsync everysec".

Thanks for your comment.
03 May 10, 08:46:26
Antirez, you should look up Richard Hipp's articles called "The Great fsync() Bug". He found the problem to be a bug in ext3 while working on sqlite.
Anon writes:
03 May 10, 09:21:52
fsync on OSX/Darwin doesn't necessarily wait for the data to be reported as safely written before returning. See http://shaver.off.net/diary/2008/05/25/fsyncers-an... (Why doesn’t other (non-sqlite) software do this?) for details.
03 May 10, 11:17:00
Here's what I see on xfs on the same filesystem as a MySQL/InnoDB instance that's active:

Write in 5 microseconds
Write in 5 microseconds
Write in 5 microseconds
Write in 5 microseconds
Write in 4 microseconds
Write in 6 microseconds
Write in 6 microseconds
Write in 5 microseconds
Write in 5 microseconds
Sync in 26 microseconds (0)
Write in 5 microseconds
Write in 5 microseconds
Write in 6 microseconds
Write in 5 microseconds
Write in 4 microseconds
Write in 5 microseconds
Write in 5 microseconds
Write in 5 microseconds
Write in 4 microseconds
Sync in 49 microseconds (0)
Write in 6 microseconds
Write in 5 microseconds
Write in 5 microseconds
Write in 5 microseconds

-- Jeremy
Vimal writes:
03 May 10, 11:37:25
Numbers on OpenBSD 4.6 (32 bit).. FFS, but mounted over NFS. Are calls to NFS nonblocking?

-bash-4.0$ ./fsynctest
Sync in 61 microseconds (0)
Write in 31 microseconds
Write in 9 microseconds
Write in 9 microseconds
Write in 9 microseconds
Write in 9 microseconds
Write in 9 microseconds
Write in 9 microseconds
Write in 8 microseconds
Write in 9 microseconds
Write in 7 microseconds
Sync in 56 microseconds (0)
Write in 10 microseconds
Write in 9 microseconds
Write in 8 microseconds
Write in 9 microseconds
Write in 8 microseconds
Write in 8 microseconds
Write in 9 microseconds
Write in 8 microseconds
Write in 8 microseconds
Sync in 49 microseconds (0)
Write in 9 microseconds
Write in 8 microseconds
Write in 9 microseconds
Write in 10 microseconds
Write in 10 microseconds
Write in 8 microseconds
Write in 8 microseconds
Write in 9 microseconds
Write in 9 microseconds
Sync in 63 microseconds (0)
Niraj Tolia writes:
03 May 10, 12:46:14
Hi Salvatore,

You state that fsync always "is so slow to be basically impossible to use, at the point I'm thinking about dropping it." I would definitely argue against doing this because a number of people do care about durability and they are aware of the cost of enabling this option. Silently losing data is not always an acceptable option.
antirez writes:
03 May 10, 13:02:20
@Niraj: now that we have a sane option (O_SYNC) there are no chances that this is going away indeed. I was tempted to almost removing it as yesterday I measured 20 queries/second against a fast Linux box... this was inacceptable, but now with O_SYNC things are starting to work well again. Hacking on it right now... thanks for the comment
notzed writes:
05 May 10, 00:49:15
I don't think you understand what fsync is for, which is pretty disturbing since this is for a database product.

If you don't care when it writes - you dont need to use it at all. And what i mean by that, is that if you can't guarantee a write has been completed before doing other things, then that means you don't care when it writes.

Putting it in another thread wont matter - and duh, of course it will 'block' all writes to the same file, that's exactly what it should be doing.

If you do need it, some slowness is unavoidable, but what is 'slow' is somewhat relative.

I'd try a better filesystem though, extN is god-awful slow at anything like that.
antirez writes:
05 May 10, 09:29:23
@notzed: we have a mode in Redis-server when you don't want guarantees of write-before-ack, but when you are more happy if at least 1 time every second the kernel buffers are flushed on disk.

In this case it made sense to use another thread, but unfortunately write will block anyway.
05 May 10, 13:56:11
It'd be interesting to see a test using mmap to a sparsely allocated file and msync. I suspect it would behave the same, but worth checking.
12 May 10, 20:01:48
Antirez,

Regarding O_SYNC be careful - 250 microseconds on plain disk means something is not making it to the disk. This is just too fast. Speaking about fully durable option - the people who will find it usable will be running it on RAID and in this case see Jeremy Zawoodny example it is cheap. The most important thing about fsync() perhaps its behavior on server grade systems have nothing to do with behavior on your standard laptop/desktop system.
Michael Herf writes:
15 May 10, 14:53:47
Have you tried using a new file after each fsync? Or round-robin between 'n' open files with serial numbers? Wondering if ext3 will handle this better (probably fragmentation is an issue though).
havana writes:
19 Jun 10, 03:40:07
about my system and gcc:

Linux gauss 2.6.32-23-generic #37-Ubuntu SMP Fri Jun 11 07:54:58 UTC 2010 i686 GNU/Linux
gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)

See you
havana writes:
19 Jun 10, 03:44:32
Sorry, here the test:

Write in 39 microseconds
Sync in 2816 microseconds (0)
Write in 34 microseconds
Write in 26 microseconds
Write in 26 microseconds
Write in 26 microseconds
Write in 26 microseconds
Write in 27 microseconds
Write in 25 microseconds
Write in 25 microseconds
Write in 26 microseconds
Write in 26 microseconds
Sync in 1788 microseconds (0)
Write in 31 microseconds
kenn writes:
14 Jul 10, 04:02:54
I thought the point in running fsync() on a different thread was that the main thread would never be blocked for *read* requests for clean pages?
Prayer writes:
16 Aug 10, 03:59:40
Did you look at the O_DIRECT flag when opening the file ?
It allows your application to bypass some I/O caching
of the operating system. But the counterpart is that you
have to implement caching policy yourself.
22 Sep 10, 11:12:46
Have you looked at aio_fsync (http://www.opengroup.org/onlinepubs/009695399/func...)?

I also need background flush command so I can ensure that the kernel doesn't fill the page cache with pending writes. I am in Java land so I haven't gotten around to testing it out.

It certainly seems like it would fit the bill in your case. You can withhold a response until the IO completes and continue to do other work in the meantime.
antirez writes:
24 Sep 10, 06:36:39
@Ariel: very cool! Thanks. I'll try how this works, but sounds very very promising.
24 Sep 10, 11:42:27
A follow up RE: fdatasync. I wrote this test case http://pastebin.com/BEGRXTDM

Which outputs:

Starting first writes at 1285341921
Finishing first writes with duration 18 at 1285341939
Starting first sync at 1285341939
Beginning second writes at 1285341939
Finished second writes with duration 22 at 1285341961
Finished first sync with duration 26 at 1285341965
Doing final sync 1285341965
Did final sync 1285341965

I think this shows that fdatasync (what FileChannel.force(false) uses) can interfere with writes but not completely block them. This doesn't contradict your results, but it shows that performing the fsync in a separate thread has some utility. It looks like the kernel is actively flushing new data as it comes in while an fsync call is pending which seems to contradict the idea that a livelock avoidance mechanisms exists that completely block writes during an fsync. I did quite a bit of Googling and was not clear on that point, hence the test case.

One approach for the redis log case might be to use aio_write + a dedicated fsync thread. Or you could go all aio and stay single threaded.

I think that the suggestion to use a separate file for each second of commit log is also a good approach. This would ensure that the fsync never interferes with writes to the active file. This is what Cassandra does for batch commits. It has the advantage of allowing truncation via deletion.

I have been meaning to find out how Redis compacts its log.
27 Sep 10, 10:48:11
Sorry, pastebin link broke http://aweisberg.pastebin.com/VXkQx8sa
http://www.predealcazare.ro writes:
28 Mar 11, 07:48:09
This function is very difficult to implement in the correct direction. I suppose the tutorial is better because I've used the code and it works well. I stable and easy to use and this is the most important thing.
comments closed