Fix macOS failing to flush blockfiles to disk for certain external drives

john-moffett commented at 5:10 pm on November 19, 2022: contributor

This PR attempts to close (or at least help) #26455 (and closed duplicates #26400 #26530).

Context

macOS’s fsync only flushes writes to the drive, but doesn’t ask the drive to immediately commit the data to permanent storage (ie - physically write it) (manpage). The data will eventually be physically written, but there are no guarantees when it will happen – it could be several seconds or more. For a stronger ‘immediate-write’ guarantee, they offer an F_FULLFSYNC option to fcntl (manpage) to accomplish that. This fsync-on-macOS behavior has caused historical problems in this project for macOS builds, so, since then, the F_FULLSYNC approach has been used for writing (see #3000).

However, certain external drives or filesystems do not support fcntl with F_FULLFSYNC option. (And some that do supposedly support the option do not actually do it.) These drives/filesystems will typically report errno ENOTSUP or ENOTTY, which is what was reported in the referenced issues. However, there is no reliable way to determine lack of support, so relying on specific error codes is insufficient.

This bug seems to only be affecting macOS 13 (Ventura) systems, but in theory, any macOS version could be affected.

What other projects have done

Other projects, including LevelDB, SQLite, and MariaDB have experienced similar issues, and have taken the strategy of falling back on the (less reliable) fsync when F_FULLFSYNC isn’t supported. (See helpful Twitter thread.) Some (SQLite, MariaDB, LevelDB) fall back to fsync on any error, and others only fall back on specific errors (BerkeleyDB 4.8 falls back only on ENOTSUP, which is why this PR may not fully fix #26455, as the wallet is getting the error ENOTTY.)

This PR

I propose to follow that general strategy and fall back on fsync if F_FULLFSYNC fails for any reason, while preserving both error codes for debugging.

(I think we should also print the error strings using SysErrorString rather than print the raw platform-dependent error codes, but I wanted to keep this PR as atomic as possible and save those changes (4 or 5 lines in addition to these) for another PR (which I’d be happy to submit – or modify this PR to include).)

Fix macOS failing to flush blockfiles to disk

Certain external drives or filesystems do not support fcntl
with F_FULLFSYNC option. Rather than fail, try regular fsync
instead. Other projects take this approach including LevelDB
and SQLite.

e663ee9b44

DrahtBot commented at 5:10 pm on November 19, 2022: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	jamesob

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

luke-jr changes_requested

luke-jr commented at 9:41 pm on November 19, 2022: member

Since fsync doesn’t do the whole job, it shouldn’t silence the error/return-false when F_FULLFSYNC fails.

And if, as you say, some drives claim to have done it while they actually haven’t, should we perhaps consider falling back to fsync even if F_FULLFSYNC succeeds? (Or maybe F_FULLFSYNC always at least does the equivalent of fsync? In which case I ask: does it do so even if F_FULLFSYNC returns an error?)

john-moffett commented at 10:09 pm on November 19, 2022: contributor

Since fsync doesn’t do the whole job, it shouldn’t silence the error/return-false when F_FULLFSYNC fails.

Sorry, I was unclear. (I’ll edit my comment.) fsync does eventually physically write to disk on macOS, but it just does it on its own time. It’s more vulnerable to corruption / data loss if, for example, the machine or drive loses power. The F_FULLFSYNC option is better just because it actually physically writes immediately after it flushes to disk. Prior to #3000, the macOS version just used fsync. This PR goes back to using fsync if the (better) F_FULLFSYNC isn’t available.

And if, as you say, some drives claim to have done it while they actually haven’t, should we perhaps consider falling back to fsync even if F_FULLFSYNC succeeds?

Some drives claim to write immediately with the fcntl F_FULLFSYNC call, but in reality only do the equivalent of macOS’s version of fsync – that is, they just flush to disk, return ASAP, and let the disk actually physically write whenever it wants, rather than blocking until the physical write is complete.

Or maybe F_FULLFSYNC always at least does the equivalent of fsync?

I think that’s what happens for the drives that ’lie’, which is what I just talked about. However, I don’t think any actual flushes or physical writes happen if F_FULLFYNC returns an error.

jamesob approved

jamesob commented at 2:20 pm on November 28, 2022: member

ACK e663ee9b448c2aad7449666b07d7fb87ea492d31 (jamesob/ackr/26534.1.john-moffett.fix_macos_failing_to_flu)

Thanks for the great PR and welcome to the project!

The patch is compact and well-described. Conceptually it’s pretty simple: in a nutshell this just calls fsync() should we fail the more specific fcntl() call. There is good precedent for this approach; bitcoin already relies on this strategy in our vendored leveldb code.

I’ve built and reveiewed the change locally.

 0-----BEGIN PGP SIGNED MESSAGE-----
 1Hash: SHA512
 2
 3ACK e663ee9b448c2aad7449666b07d7fb87ea492d31 ([`jamesob/ackr/26534.1.john-moffett.fix_macos_failing_to_flu`](https://github.com/jamesob/bitcoin/tree/ackr/26534.1.john-moffett.fix_macos_failing_to_flu))
 4
 5Thanks for the great PR and welcome to the project! 
 6
 7The patch is compact and well-described. Conceptually it's pretty simple: in a
 8nutshell this just calls `fsync()` should we fail the more specific `fcntl()`
 9call. There is good precedent for this approach; bitcoin [already relies on this
10strategy](https://github.com/jamesob/bitcoin/blob/e663ee9b448c2aad7449666b07d7fb87ea492d31/src/leveldb/util/env_posix.cc#L381-L394)
11in our vendored leveldb code.
12
13I've built and reveiewed the change locally.
14-----BEGIN PGP SIGNATURE-----
15
16iQIzBAEBCgAdFiEEGNRVI1NPYuZCSIrGepNdrbLETwUFAmOEw90ACgkQepNdrbLE
17TwUNBQ//QTTTV7TKsRbXSjXw1efU7jP/srJHtJoDNfTj8ZV4vbqfkrERGITi+Aud
18DAi5RhdU9PyO5hX5Z19bDuGoGxg3CHFNE6gKH2NXwdxsO7E1yJ44N+QQlvId9cJB
19+52xm+rXQeP4/aHHwmo6JuFmQCJC6Hsri7SYNbtJJWSU+qg+HSjAemtj4HKemZk3
20vupd4uPThcsrtHjen6ksW5EofXOikATl86BCYA/X0HC4mG7vQSbR5co1S0bVc2y7
212RxVOeAe0TI8HvBKiOzCT2fBRVmF/OL1vE3Uks4X8ObZLv3D3eLF7Lz2Kmu57n18
22sRKgE/o0PtlxwaTW4Sz19WtN3D3lOVE3lx1W/+vRi3RNHTSf4PqtvxZO02S1Kq9I
23IhhhekQIvrDsOrgaydyyTjTgxT3CRYiZI7uDTEu8xPqSo6GTNG9WpFC2olOWGLna
241vlOA/edBeZ0WCagEHGtouxnicvKUYzVK56uIpbJA2pnbnbnyempaEBs592LBxh+
25O/WJPsTqPGVZa0YhaMYk2TCVBA+XBAQsdkRYlkHLZw8OJsa5E+6k1TVZ1aXAFGOm
26BJ1AI6LOgMeeR9qZrdjHoroR9xFyC4kS4h+/BmKUSM6mhLaYoOoNaenCOlXK7RGV
27TcVocuW1tJrC4bgiC0ParDeFas5s0SsaUMrJJV5I+788QLcco4A=
28=JarI
29-----END PGP SIGNATURE-----

0Tested on Linux-6.0.2-arch1-1-x86_64-with-glibc2.36
1
2Configured with ./configure --without-bdb 'CXXFLAGS=-fPIE -pipe -O2 -g -Wthread-safety-analysis -Wall -Werror=sign-compare ' --disable-gui-tests --disable-fuzz-binary --enable-wallet --enable-debug --with-daemon --enable-natpmp-default CXX=/usr/bin/clang++ CC=/usr/bin/clang PYTHONPATH= --disable-shared --with-pic --enable-benchmark=no --enable-module-recovery --enable-module-schnorrsig --disable-shared --with-pic --enable-benchmark=no --enable-module-recovery --enable-module-schnorrsig --no-create --no-recursion
3
4Compiled with /usr/bin/ccache /usr/bin/clang++ -std=c++17 -mavx -mavx2 -mpclmul -fPIE -pipe -O2 -g -Wthread-safety-analysis -Wall -Werror=sign-compare  -O0 -g3 -ftrapv -fdebug-prefix-map=$(abs_top_srcdir)=.  -Wstack-protector -fstack-protector-all -fcf-protection=full -fstack-clash-protection -msse4.1 -msse4.2 -msse4 -msha  i
5
6Compiler version: clang version 14.0.6

jamesob commented at 2:22 pm on November 28, 2022: member

Also, the single failing CI job just needs a kick but apparently I don’t have perms to rerun it from the Cirrus CI page.

hebasto commented at 2:29 pm on November 28, 2022: member

Also, the single failing CI job just needs a kick but apparently I don’t have perms to rerun it from the Cirrus CI page.

Kicked.

luke-jr commented at 10:20 pm on November 28, 2022: member

fsync does eventually physically write to disk on macOS, but it just does it on its own time. It’s more vulnerable to corruption / data loss if, for example, the machine or drive loses power.

All writes are by default “on their own time”. The entire purpose of fsync is to force it to be immediate and block on that, so that after the call returns, there is no risk of corruption / data loss from something as simple as a power failure.

So if that isn’t happening, this function has failed, and should return an error. (If we don’t care, it should be ignored or otherwise handled the same for all failure cases, not just some.)

john-moffett commented at 0:14 am on November 29, 2022: contributor

@jamesob , thank you for your kind words and the time you took to review. @luke-jr Thanks for the comment, Luke. I believe I understand your point – thatFileCommit exists to assure that the data is actually physically written by the time the function returns, so we shouldn’t let macOS get away with breaking that contract just because the hardware doesn’t support it. It’s better to simply fail (or, alternatively, allow other systems to get away with the same behavior).

My thinking was that doing macOS’s poor fsync was better than leaving the hardware unsupported. To be fair, it does do something in this case – it flushes the buffer to the drive. However, it doesn’t provide the full guarantee implied by the comments. My assumption was that it was more of a ‘best-effort’ situation.

I’ll obviously defer to your judgment in situations like these, so I’m happy to close this if my understanding of your comment was correct.

luke-jr commented at 0:45 am on November 29, 2022: member

I agree we should do something. If FULLFSYNC fails, calling fsync is reasonable.

If we are unable to flush, ideally we should inform the user (perhaps make it a GUI-visible warning?), and let them choose another storage device.

OTOH, for block files, we can always reprocess the block again if needed, so that might actually be a better solution if we can’t flush properly.

luke-jr commented at 3:48 am on November 29, 2022: member

Looking at this again, it seems fflush failure indicates the write actually failed, and in that case we should terminate the program entirely.

If any of the others fail, I think we should post a warning to the user (maybe just before/during IBD?) and continue.

And in the meantime (out of scope here IMO), we shouldn’t be flushing block files at all. It’s an expensive operation, and usually unnecessary. Instead, we should just ensure we can recover cleanly from missing or corrupt recent block data.

sipa commented at 3:54 am on November 29, 2022: member

@luke-jr We flush the block files whenever there is a write to the block index, because the block index contains references (file numbers, byte offsets) to the blk*.dat files. So if somehow a write to the block index does make it, but a write to the block data does not, we may have a corruption. Similarly, we flush the block index before flushing the chainstate, because the chainstate contains references to the block index.

In that sense, we really don’t primarily use flushing for finality, but for ordering operations on disk. I suspect that even on systems where a flush doesn’t actually cause the write to hit disk, it’ll still order operations (writes before will make it before writes after do).

luke-jr commented at 5:49 am on November 29, 2022: member

With regard to my last (out of scope) comment, I am thinking more for performance reasons. There’s no reason we can’t detect the missing block data when we verify blocks at startup, and handle it gracefully.

in src/util/system.cpp:1189 in e663ee9b44

1185@@ -1186,8 +1186,15 @@ bool FileCommit(FILE *file)
1186     }
1187 #elif defined(MAC_OSX) && defined(F_FULLFSYNC)
1188     if (fcntl(fileno(file), F_FULLFSYNC, 0) == -1) { // Manpage says "value other than -1" is returned on success
1189-        LogPrintf("%s: fcntl F_FULLFSYNC failed: %d\n", __func__, errno);
1190-        return false;
1191+        // Certain filesystems may not support F_FULLFSYNC and fail with various error codes.

fanquake commented at 10:52 am on November 29, 2022:

Can we be more specific here? non-Apple file systems?

fanquake commented at 11:17 am on November 29, 2022: member

This bug seems to only be affecting macOS 13 (Ventura) systems, but in theory, any macOS version could be affected.

Do we know what has changed in macOS Ventura to make this start occuring? You say this could affect other versions, but given things only started breaking again with the release of Ventura (and users claim things worked perfectly with previous major versions, i.e: #26400 (comment)), clearly something has changed. It would be good to understand what that is.

I’ve had a look through the XNU sources for Ventura (https://github.com/apple-oss-distributions/xnu/tree/xnu-8792.41.9) and the previous version (Monterey): https://github.com/apple-oss-distributions/xnu/tree/xnu-8020.140.41, and the functions pointed too by https://github.com/google/leveldb/commit/296de8d5b8e4e57bd1e46c981114dfbe58a8c4fa, and didn’t see anything obvious.

gawlk commented at 10:38 pm on November 29, 2022: none

Hi y’all,

I’ve been having this issue on 2 m1 Macs with MacOS 13 but it seem to work on the latest MacOS 13.1 Beta !

On both previous Macs, it crashed after a couple minutes but on the latest on it’s been running for a while now and still going. So it might actually be a non issue.

For information the hard drive used for all 3 attempts is the same 1TB T7 Shield from Samsung.

Here’s a screenshot:

john-moffett commented at 11:56 pm on November 29, 2022: contributor

Thanks, @gawlk !

I suppose this is, indeed, a transient OS bug. I’m also running 13.1 and have tried seven different non-Apple external drives with various filesystems and have yet to trigger the issue. @fanquake , I had a look at the XNU sources, too, and likewise couldn’t find anything, though some code around an ENOTTY emission changed recently:

I can’t fathom why that would affect us in this case, though, so it may remain a mystery.

Regardless, I think it’s best to shelve this. Thanks for the comments!

john-moffett closed this on Nov 29, 2022

bitcoin locked this on Nov 29, 2023

Fix macOS failing to flush blockfiles to disk for certain external drives #26534

Context

What other projects have done

This PR

Reviews