Comnts for post fsync() on a different thread: apparently a useless trick

Comments for post fsync() on a different thread: apparently a useless trick

http://www.predealcazare.ro writes: This function is very difficult to implement in the correct direction. I suppose the tutorial is better because I've used the code and it works well. I stable and easy to use and this is the most important thing.

Ariel Weisberg writes: Sorry, pastebin link broke http://aweisberg.pastebin.com/VXkQx8sa

Ariel Weisberg writes: A follow up RE: fdatasync. I wrote this test case http://pastebin.com/BEGRXTDM Which outputs: Starting first writes at 1285341921 Finishing first writes with duration 18 at 1285341939 Starting first sync at 1285341939 Beginning second writes at 1285341939 Finished second writes with duration 22 at 1285341961 Finished first sync with duration 26 at 1285341965 Doing final sync 1285341965 Did final sync 1285341965 I think this shows that fdatasync (what FileChannel.force(false) uses) can interfere with writes but not completely block them. This doesn't contradict your results, but it shows that performing the fsync in a separate thread has some utility. It looks like the kernel is actively flushing new data as it comes in while an fsync call is pending which seems to contradict the idea that a livelock avoidance mechanisms exists that completely block writes during an fsync. I did quite a bit of Googling and was not clear on that point, hence the test case. One approach for the redis log case might be to use aio_write + a dedicated fsync thread. Or you could go all aio and stay single threaded. I think that the suggestion to use a separate file for each second of commit log is also a good approach. This would ensure that the fsync never interferes with writes to the active file. This is what Cassandra does for batch commits. It has the advantage of allowing truncation via deletion. I have been meaning to find out how Redis compacts its log.

antirez writes: @Ariel: very cool! Thanks. I'll try how this works, but sounds very very promising.

Ariel Weisberg writes: Have you looked at aio_fsync (http://www.opengroup.org/onlinepubs/009695399/functions/aio_fsync.html)? I also need background flush command so I can ensure that the kernel doesn't fill the page cache with pending writes. I am in Java land so I haven't gotten around to testing it out. It certainly seems like it would fit the bill in your case. You can withhold a response until the IO completes and continue to do other work in the meantime.

Prayer writes: Did you look at the O_DIRECT flag when opening the file ? It allows your application to bypass some I/O caching of the operating system. But the counterpart is that you have to implement caching policy yourself.

kenn writes: I thought the point in running fsync() on a different thread was that the main thread would never be blocked for *read* requests for clean pages?

havana writes: Sorry, here the test: Write in 39 microseconds Sync in 2816 microseconds (0) Write in 34 microseconds Write in 26 microseconds Write in 26 microseconds Write in 26 microseconds Write in 26 microseconds Write in 27 microseconds Write in 25 microseconds Write in 25 microseconds Write in 26 microseconds Write in 26 microseconds Sync in 1788 microseconds (0) Write in 31 microseconds

havana writes: about my system and gcc: Linux gauss 2.6.32-23-generic #37-Ubuntu SMP Fri Jun 11 07:54:58 UTC 2010 i686 GNU/Linux gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) See you

Michael Herf writes: Have you tried using a new file after each fsync? Or round-robin between 'n' open files with serial numbers? Wondering if ext3 will handle this better (probably fragmentation is an issue though).

Peter Zaitsev writes: Antirez, Regarding O_SYNC be careful - 250 microseconds on plain disk means something is not making it to the disk. This is just too fast. Speaking about fully durable option - the people who will find it usable will be running it on RAID and in this case see Jeremy Zawoodny example it is cheap. The most important thing about fsync() perhaps its behavior on server grade systems have nothing to do with behavior on your standard laptop/desktop system.

Jason Watkins writes: It'd be interesting to see a test using mmap to a sparsely allocated file and msync. I suspect it would behave the same, but worth checking.

antirez writes: @notzed: we have a mode in Redis-server when you don't want guarantees of write-before-ack, but when you are more happy if at least 1 time every second the kernel buffers are flushed on disk. In this case it made sense to use another thread, but unfortunately write will block anyway.

notzed writes: I don't think you understand what fsync is for, which is pretty disturbing since this is for a database product. If you don't care when it writes - you dont need to use it at all. And what i mean by that, is that if you can't guarantee a write has been completed before doing other things, then that means you don't care when it writes. Putting it in another thread wont matter - and duh, of course it will 'block' all writes to the same file, that's exactly what it should be doing. If you do need it, some slowness is unavoidable, but what is 'slow' is somewhat relative. I'd try a better filesystem though, extN is god-awful slow at anything like that.

antirez writes: @Niraj: now that we have a sane option (O_SYNC) there are no chances that this is going away indeed. I was tempted to almost removing it as yesterday I measured 20 queries/second against a fast Linux box... this was inacceptable, but now with O_SYNC things are starting to work well again. Hacking on it right now... thanks for the comment

Niraj Tolia writes: Hi Salvatore, You state that fsync always "is so slow to be basically impossible to use, at the point I'm thinking about dropping it." I would definitely argue against doing this because a number of people do care about durability and they are aware of the cost of enabling this option. Silently losing data is not always an acceptable option.

Vimal writes: Numbers on OpenBSD 4.6 (32 bit).. FFS, but mounted over NFS. Are calls to NFS nonblocking? -bash-4.0$ ./fsynctest Sync in 61 microseconds (0) Write in 31 microseconds Write in 9 microseconds Write in 9 microseconds Write in 9 microseconds Write in 9 microseconds Write in 9 microseconds Write in 9 microseconds Write in 8 microseconds Write in 9 microseconds Write in 7 microseconds Sync in 56 microseconds (0) Write in 10 microseconds Write in 9 microseconds Write in 8 microseconds Write in 9 microseconds Write in 8 microseconds Write in 8 microseconds Write in 9 microseconds Write in 8 microseconds Write in 8 microseconds Sync in 49 microseconds (0) Write in 9 microseconds Write in 8 microseconds Write in 9 microseconds Write in 10 microseconds Write in 10 microseconds Write in 8 microseconds Write in 8 microseconds Write in 9 microseconds Write in 9 microseconds Sync in 63 microseconds (0)

Jeremy Zawodny writes: Here's what I see on xfs on the same filesystem as a MySQL/InnoDB instance that's active: Write in 5 microseconds Write in 5 microseconds Write in 5 microseconds Write in 5 microseconds Write in 4 microseconds Write in 6 microseconds Write in 6 microseconds Write in 5 microseconds Write in 5 microseconds Sync in 26 microseconds (0) Write in 5 microseconds Write in 5 microseconds Write in 6 microseconds Write in 5 microseconds Write in 4 microseconds Write in 5 microseconds Write in 5 microseconds Write in 5 microseconds Write in 4 microseconds Sync in 49 microseconds (0) Write in 6 microseconds Write in 5 microseconds Write in 5 microseconds Write in 5 microseconds -- Jeremy

Anon writes: fsync on OSX/Darwin doesn't necessarily wait for the data to be reported as safely written before returning. See http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/ (Why doesn’t other (non-sqlite) software do this?) for details.

Baron Schwartz writes: Antirez, you should look up Richard Hipp's articles called "The Great fsync() Bug". He found the problem to be a bug in ext3 while working on sqlite.

antirez writes: @Uriel: it's a solution indeed, but replying after write(2) or replying after just writing some data into an user-space buffer is pretty different from the point of view of durability. For instance with the current implementation even with "fsync never" policy it's possible to hack with /proc settings in order to get decent compromises between flush times and performances. Also this only solves the problem with "fsync everysecond" that is already working pretty well. My current idea is that the best thing to do is using O_SYNC for "fsync always" and just use fdatasync() without a thread for "fsync everysec". Thanks for your comment.

Uriel Katz writes: here is a simple solution,don`t use write,copy the data to some userland buffer and then the thread will write them and fsync,so writing and fsync is done only on the IO thread,the thread that accepts connections only write to memory which is fast and non-blocking

antirez writes: @jani: thanks! @Paul: yes definitely fsync() on Darwin is much faster... this is why the lame behavior of "fsync always" was unnoticed for so many time, I used to develop primarily on Mac OS X and test it on Linux from time to time. Now VMware provided a great Linux box I'm using as my primary development box.

Paul writes: Ran this on my OSX machine, here's the output: Write in 33 microseconds Sync in 559 microseconds (0) Write in 51 microseconds Write in 52 microseconds Write in 52 microseconds Write in 47 microseconds Write in 52 microseconds Write in 51 microseconds Write in 49 microseconds Write in 51 microseconds Write in 49 microseconds Sync in 781 microseconds (0) Write in 39 microseconds Write in 51 microseconds Write in 50 microseconds Write in 49 microseconds Write in 54 microseconds Write in 50 microseconds Write in 50 microseconds Write in 51 microseconds Write in 51 microseconds Write in 49 microseconds Sync in 641 microseconds (0) Write in 29 microseconds Write in 51 microseconds I know, darwin not linux, just wanted to show it off :p

jani writes: indeed, I can reproduce now if I copy a large file in the same time. aio_write (even though it may do the same thing in the background) does not show the delays by copying the same big file #include <aio.h> ... struct aiocb aio; ... aio.aio_fildes = fd; aio.aio_buf = "x"; aio.aio_nbytes = 1; while(1) { start = microseconds(); if (aio_write(&aio) == -1) {

antirez writes: Update: all my tests were done in ext4. I used to think it was an ext3 install but this is not the case.

antirez writes: @kosaki: great news about ext4! I'll try to perform some benchmark with it. Thanks again.

antirez writes: @kosaki: thanks for your comment! I can avoid synching metadata without problems, but if I understand this correctly, in an append only file context, not synching the size seems equivalent to leaking data if a crash occurs I guess... so this is not possible in my context. fallocate: this seems interesting in order to speed up the AOF indeed. Thanks. Still probably not able to fix my problem. About lifelocks, I think that flushing the buffers as they are at the time fsync() is called is more than enough in many applications. It's ok that new data arriving after the call will be flushed in the next fsync() in many applications I guess. So this is a strange behavior form the point of view of the least surprise, but maybe this is just the standard. Still as I can see it is clearly stated in the source code. Strange indeed. Thank you very much for the informative comment!

kosaki writes: And, I'm guessing now you use ext3 filesystem. ext4 have great improvement in this area. To run your test program on ext4 got Write in 107 microseconds Sync in 25051 microseconds (0) Write in 63 microseconds Write in 46 microseconds Write in 33 microseconds Write in 40 microseconds Write in 139 microseconds Write in 38 microseconds Write in 39 microseconds Write in 32 microseconds Write in 39 microseconds Write in 41 microseconds Sync in 20932 microseconds (0) Write in 79 microseconds Write in 42 microseconds Write in 53 microseconds Write in 44 microseconds Write in 35 microseconds Write in 50 microseconds Write in 35 microseconds Write in 35 microseconds Write in 42 microseconds Write in 56 microseconds Thanks.

kosaki writes: Hi Your write seems change file size. It mean fdatasync behave as fsync. can you please try sync_file_range or fallocate? FWIW, The exclusion between fsync and write is necessary for avoiding livelock. please remember, fsync on some filesystem (e.g. nfs) is very slow. if we allow concurrent write, fsync might be never end. see linux/fs/sync.c ------------------------------ int vfs_fsync_range(struct file *file, struct dentry *dentry, loff_t start, loff_t end, int datasync) { (snip) /* * We need to protect against concurrent writers, which could cause * livelocks in fsync_buffers_list(). */ mutex_lock(&mapping->host->i_mutex); err = fop->fsync(file, dentry, datasync); if (!ret) ret = err; mutex_unlock(&mapping->host->i_mutex); (snip) } disclaimer: I'm linux kernel developer, but I'm NOT fs developer. then, perhaps I'm saying completely wrong thing ;)

antirez writes: @jani: about NON_BLOCK, it should not be related, the operation is blocking but it is in another thread, the problem is that at kernel level Linux appears to lock the file for some reason while a costly flush is happening. If you try to copy a big file while running the test you can notice it much more easily.

antirez writes: @jani: you may need to run this for some time, or to speedup the collision try with lower usleep() time in the fsync-ing thread.

jani writes: Works fine for me, all write times are small and quiet uniforms. Have you tried passing O_NONBLOCK to open() in the write thread? Write in 11 microseconds Write in 14 microseconds Write in 14 microseconds Write in 21 microseconds Write in 15 microseconds Write in 15 microseconds Write in 15 microseconds Write in 15 microseconds Write in 15 microseconds Write in 15 microseconds Sync in 456 microseconds (0)

home