leveldb: Win32WritableFile without memory mapping

laanwj commented at 9:43 pm on October 30, 2015: member

Use a simple Win32WritableFile, equivalent to PosixWritableFile in posix_env.cc

The goal is to solve leveldb corruption issues on windows - this issue has been reported many times, see #5865, #6727, #5610, …

In my testing with this change I was unable to cause the database to be corrupted when crashing a windows laptop in various ways using notmyfault. Without the change it happened the first time I tried.

Use a simple Win32WritableFile, equivalent to PosixWritableFile in
posix_env.cc

256bea33ab

laanwj added the label UTXO Db and Indexes on Oct 30, 2015

sipa commented at 9:45 pm on October 30, 2015: member

Please submit to https://github.com/bitcoin/leveldb first?

laanwj commented at 9:51 pm on October 30, 2015: member

Thought about that, but I think it gets more exposure and testing here. (obviously it should still go there when accepted…)

sipa commented at 10:06 pm on October 30, 2015: member

@laanwj Sure!

jgarzik commented at 10:31 pm on October 30, 2015: contributor

ut ACK

dcousens commented at 0:24 am on October 31, 2015: contributor

ut ACK

sipa commented at 4:27 am on October 31, 2015: member

We should probably benchmark a reindex with this?

gmaxwell commented at 4:38 am on October 31, 2015: contributor

@sipa Yes, and if its much slower it’s reason to figure out whats strictly required for the syncs. And if it’s not, we could call it done. Reindex time right now is ruined by signature validation, if someone benchmarks that they must be sure to bypass.

laanwj commented at 7:40 am on October 31, 2015: member

This is the same as how it works on POSIX already, directly converted to WIN32 API. By all means, benchmark, but IMO we should apply this nevertheless (if it has no new bugs). We should first get rid of stability issues and only then worry about optimization. This will likely save many people from having to do a reindex at all.

MarcoFalke commented at 8:45 am on October 31, 2015: member

Concept ACK

leveldb: Don't do anything in Flush()

Flush() is supposed to clear application-side buffers (like fflush). As
this writable file implementation uses the OS function directly, there
are no buffers to flush.

2fc32341cc

laanwj commented at 9:12 am on October 31, 2015: member

Last commit is optional and potentially risky. From what I understand, leveldb’s WritableFiles are supposed to do their own caching, and flush() is supposed to flush this cache to the OS (its implementation is fflush(FILE*) in POSIX). As this implementation uses OS calls directly, there is no need for that. With that reasoning, Flush() can be empty.

… it’s of course also possible to use libc FILE* fwrite fflush on Windows instead of using the WIN32 API directly, then use this hack to sync, this will provide local buffering, but I was afraid that extra level of buffering may introduce syncing issues (see my post below for an implementation of this)

Then again these are all performance concerns. As said above it may be wrong to worry about that here, I’m mostly concerned with this being correct.

Diapolo commented at 9:18 am on October 31, 2015: none

Isn’t LevelDB dead at all? Their Github repo seemed to be not actively developed on the last time I checked.

laanwj commented at 9:19 am on October 31, 2015: member

@Diapolo That is a good observation. That’s why I’m doing an attempt at maintaining it. Can you help testing?

Diapolo commented at 9:25 am on October 31, 2015: none

@laanwj Yes, will first do a -reindex on testnet to be able to compare before and after… after that I’m going to kill the testnet instance and see if it’s corrupting.

laanwj commented at 9:26 am on October 31, 2015: member

Thanks!

laanwj commented at 11:02 am on October 31, 2015: member

Here is an alternative implementation of Win32WritableFile that uses application-buffered libc primitives (fopen, fwrite, …): https://github.com/laanwj/bitcoin/tree/2015_10_leveldb_win_nomap2 . Barely tested, and I’d not necessarily suggest using it instead of this, but if someone is going to benchmark it should be included (as it reduces the number of system calls in case of lots of small writes, which we don’t do, but leveldb may internally…).

laanwj commented at 9:03 am on November 1, 2015: member

@jonasschnelli it would be useful to have executables here, mind pointing your nightly builder at this?

jonasschnelli commented at 2:05 pm on November 1, 2015: contributor

Here we go: https://bitcoin.jonasschnelli.ch/pulls/6917/

NicolasDorier commented at 2:59 pm on November 1, 2015: contributor

I am so glad about it you can’t imagine the time I lost because of this bug. This is the only reason I don’t have a full node on my laptop. Trying it right now.

in src/leveldb/util/env_win.cc: in 2fc32341cc outdated

428+    : filename_(fname)
429 {
430-    return ((x + y - 1) / y) * y;
431+    std::wstring path;
432+    ToWidePath(fname, path);
433+    DWORD Flag = PathFileExistsW(path.c_str()) ? OPEN_EXISTING : CREATE_ALWAYS;

sipa commented at 8:20 pm on November 1, 2015:

Any reason not to use OPEN_ALWAYS (which should open if exists, and create if not, according to https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858%28v=vs.85%29.aspx)?

sipa commented at 8:21 pm on November 1, 2015:

Oh, this code is just moved.

laanwj commented at 2:33 am on November 2, 2015:

The goal is to have f=fopen(filename, "wb") semantics. It’s possible that this could be improved, but indeed I just copied the code from the previous WritableFile implementation.

in src/leveldb/util/env_win.cc: in 2fc32341cc outdated

137-    size_t _TruncateToPageBoundary(size_t s);
138-    bool _UnmapCurrentRegion();
139-    bool _MapNewRegion();
140-    DISALLOW_COPY_AND_ASSIGN(Win32MapFile);
141-    BOOL _Init(LPCWSTR Path);
142+    std::string filename_;

sipa commented at 8:53 pm on November 1, 2015:

Is this field used at all?

laanwj commented at 2:36 am on November 2, 2015:

Good catch. It’s there for troubleshooting. Looking at env_posix it’s supposed to be passed as ‘context’ argument to IOErrors, but that’s not done at the moment.

sipa commented at 9:04 pm on November 1, 2015: member

@laanwj We don’t do small writes anyway, so I doubt an application-level cache would help.

sipa commented at 10:07 pm on November 1, 2015: member

Concept ACK in any case.

laanwj commented at 2:29 am on November 2, 2015: member

@laanwj We don’t do small writes anyway, so I doubt an application-level cache would help.

We don’t, but maybe leveldb does internally? (WritableFiles are used for all of leveldb’s file creation and writing, also when building/rebuilding tables, not just to write out database batches. Not sure how careful it is to not, say, call Append multiple times to write different parts of data structures)

jonasschnelli commented at 2:10 pm on November 2, 2015: contributor

I just have started a fresh node (mainnet / this PR) on a win8.1pro with enabled mcafee antivirus (SSD / 2GB RAM / 1.4 GHZ Core i5) … will report soon.

jonasschnelli commented at 9:25 am on November 3, 2015: contributor

Have synced up to 318301,.. twice restarted bitcoin-qt (just to see if stop / start / checkblocks works). No problems with the database so far.

My node is still syncing (very slow, serval blocks per minutes). But a strange problem appeared. Qt console window says I’m no block 318301, … looking in the debug.log i can see that im 318613… Peers windows is updating… but somehow with missing data.

bildschirmfoto-2015-11-03-um-10 19 21

Peers:

Taskmanager:

laanwj commented at 11:09 am on November 3, 2015: member

@jonasschnelli that’s a known issue and unrelated to these changes, see #5664

NicolasDorier commented at 6:36 am on November 4, 2015: contributor

Pro tip for corruption problem : Attach a disk, index the blockchain data on it, unplug savagely.

I had corruption everytimes my cat played with the ethernet cable between my computer and my network disk.

I did not managed to reproduce the problem so far by just killing the process, I’ll try once I find a spare disk somewhere.

sipa commented at 6:38 am on November 4, 2015: member

I’m pretty surprised that having the blockchain database on a network disk works at all…

NicolasDorier commented at 6:41 am on November 4, 2015: contributor

Mapped Drive, then just pointing to SMB share. I’m not at my office, so I’ll try with external drive unplugged savagely instead of real network disk as soon as I can.

sipa commented at 6:43 am on November 4, 2015: member

Well it’s good to know that it seems to work even with a network filesystem. But network filesystems usually offer much weaker synchronization guarantees, so often are inappropriate for databases with strong consistency requirements.

NicolasDorier commented at 7:15 am on November 4, 2015: contributor

Yes I agree, this is why I switched to a normal VM later on Microsoft Azure. But since I’ve moved to VM they are sometimes rebooted automatically for maintainance purposes, and it got corrupted several time.

The good thing about mapped disk is that thanks to weaker synchronization guarantee, it makes easier to reproduce data corruption problems ! :)

laanwj commented at 10:04 am on November 4, 2015: member

The issue that this change tries to improve is extremely easy to reproduce already: it happens nearly every time bitcoind or bitcoin-qt running on Windows crashes, or the Windows crashes, or the power is pulled. This happens for local filesystems on drives connected through SATA.

The PR does not purport to fix issues with external harddisks, network filesystems, corrupting cables, felines, and other additional sources of complications. Being robust to those would be great, if realistically possible, but is not the immediate goal here.

jonasschnelli commented at 12:11 pm on November 4, 2015: contributor

I have powered down (unclean sudden shutdown) my Win8.1 VM and restarted Windows and Bitcoin-Qt. Bitcoin-Qt started without issues and is ~ on the same height (353900). I’ll wait now a couple of minutes and will power down (unplug) the VM host system (MacMini) and see, what happens then.

Would it be worth to do the same procedure with the current master WITHOUT this PR to compare it?

laanwj commented at 12:45 pm on November 4, 2015: member

Would it be worth to do the same procedure with the current master WITHOUT this PR to compare it?

Sure. If you have a VM environment where you can take a snapshot, that’s helpful, you can return to that after messing up the db.

jonasschnelli commented at 12:47 pm on November 4, 2015: contributor

I made a snapshot before the first sudden shutdown. But just powered down the host by unplugging the power cable… Best chain could be activated and the node continues syncing. Will do some more shutdowns and see if nothing happens,… then switch to current master and try again.

jonasschnelli commented at 1:21 pm on November 4, 2015: contributor

Now, after a “power off” (sudden shutdown), the db is corrupted (still using this PR).

“Rebuild the block databases” ends with “error opening database”:

Log:

laanwj commented at 2:13 pm on November 4, 2015: member

Re: “bad undo data”, see #6923 - it’s a potential issue, but not something that could be solved by this (as it isn’t a leveldb problem)

jonasschnelli commented at 2:43 pm on November 4, 2015: contributor

I now did restore serval times back to my snapshot and did some sudden power downs. ~50% it will end up with a corrupt database (100% with “bad undo data”).

Will now try to run my snapshot with current master without this PR.

laanwj commented at 2:54 pm on November 4, 2015: member

That’s strange - I only managed to get the undo data error once in all my tries, whereas (without this PR) it produced consistent leveldb errors.

gmaxwell commented at 3:37 pm on November 4, 2015: contributor

Difference in VM flushing behavior, I guess — it may just be that the VM buffers and reorders writes and ignores fsync.

Although… we appear to be missing a FileCommit in UndoWriteToDisk– I think we need to have synced the blocks and undo before calling the insert.

It would probably be better for performance if it wrote the block then undo, then did the syncs on both however.

Edit: oh nevermind, FlushBlockFile hits the undo file too. :-/

leveldb: report filename in errors in Win32WritableFile d3ad568777

jonasschnelli commented at 3:50 pm on November 4, 2015: contributor

With the current master (without this PR), after every sudden shut down, i get “Corruptions: error in middle of record”.

This PR looks after an improvement but getting #6923 for roughly 50% of power-off shutdowns.

laanwj added the label Windows on Nov 4, 2015

laanwj commented at 4:49 pm on November 4, 2015: member

If it improves things, that is good.

Filed this for the leveldb repository: https://github.com/bitcoin/leveldb/pull/9

laanwj commented at 4:51 pm on November 4, 2015: member

Closing the pull here, as it has been moved to the leveldb repository. It should come back to the bitcoin repository in a subtree update.

laanwj closed this on Nov 4, 2015

MarcoFalke locked this on Sep 8, 2021

leveldb: Win32WritableFile without memory mapping #6917