28.0rc1 synchronizes much slower on Windows

vostrnad commented at 10:52 am on September 6, 2024: none

Bitcoin Core 28.0rc1 (both pre-built and self-built binary) on Windows 10 synchronizes much slower than 27.0 after around block 120,000 when blocks stop being almost empty. I’m running with default settings (dbcache=450, pruning disabled).

Time to synchronize to block 300,000: (measured around 10 runs each, alternating between the versions) 27.0: 695 seconds 28.0rc1: 2760 seconds

28.0rc1

Running the Linux pre-built binary in WSL doesn’t show the slowdown, so it’s probably limited to Windows.

~~I’m going to bisect this, but that will definitely take me a while. If anyone suspects a commit/PR that I should look at first that might help.~~ I’ve bisected this down to #28052, for more details see #30833 (comment).

sipa commented at 11:16 am on September 6, 2024: member

Try whether #28280 had any impact.

vostrnad commented at 11:25 am on September 6, 2024: none

I actually benchmarked #28280 right before it got merged: #28280 (comment)

From the results it definitely seems that PR is not to blame and the regression happened later.

andrewtoth commented at 11:39 am on September 6, 2024: contributor

#30326?

hebasto added the label Windows on Sep 6, 2024

fanquake added this to the milestone 28.0 on Sep 6, 2024

andrewtoth commented at 1:36 pm on September 6, 2024: contributor

Reason I suspect #30326 is because try_emplace must be doing more work than just find and is only called after blocks stop being empty because previous coins are now being looked up to be spent.

Relevant stackoverflow https://stackoverflow.com/questions/24209592/performance-of-emplace-is-worse-than-check-followed-by-emplace

sipa commented at 2:13 pm on September 6, 2024: member

@andrewtoth I cannot imagine that causing a 5x slowdown, though?

achow101 commented at 9:16 pm on September 6, 2024: member

Only did one run of each but I have basically the opposite result: 4406 seconds for 27.1, and 2949 for 28.0rc1.

andrewtoth commented at 10:21 pm on September 6, 2024: contributor

@vostrnad when you say synchronizes, do you mean running with -reindex-chainstate or are you starting with a fresh data directory and downloading blocks from peers? If the latter, are you downloading blocks from a node in your local network?

vostrnad commented at 10:26 pm on September 6, 2024: none

@andrewtoth I updated the OP. I’m synchronizing from a local node with a fresh datadir each time.

davidgumberg commented at 0:03 am on September 7, 2024: contributor

@vostrnad I’m trying to get a Windows box setup to reproduce this myself, but if you get a chance, could you do a run of both again with -debug=bench set, and post the final set of [bench] logs that get written to debug.log? Those might help narrow where the slowdown is coming from.

vostrnad commented at 1:16 pm on September 7, 2024: none

I’ve bisected this down to #28052. From the benchmark results it appears to be solely responsible for the slowdown between 27.0 and 28.0rc1 (so #30326 should be fine @andrewtoth).

Benchmark results: (fresh sync to block 300,000 from a local node, using default settings, all binaries self-built using Mingw-w64, around 6 runs each)

d82283950f5ff3b2116e705f931c6e89e5fdd0be (v27.0): 738s ± 16s fa7f7ac040a9467c307b20e77dc47c87d7377ded (first commit of #28052): 733s ± 21s fa895c72832f9555b52d5bb1dba1093f73de3136 (top commit of #28052): 2827s ± 9s 88f0419c1ab1e5f3ccf425435e530d8ceb72c7f1 (v28.0rc1): 2815 ± 13s

blocksxor

sipa commented at 1:19 pm on September 7, 2024: member

@vostrnad Does the performance on Windows recover if you run with the -blocksxor=0 option?

vostrnad commented at 1:38 pm on September 7, 2024: none

@sipa No, it’s exactly the same with -blocksxor=0.

davidgumberg commented at 0:25 am on September 8, 2024: contributor

@sipa No, it’s exactly the same with -blocksxor=0.

This has uncovered a bug in #28052, InitBlocksdirXorKey does not return an empty std::vector<std::byte> when the -blocksxor=0 argument is used, it instead returns a vector with 8 bytes equal to zero:

 0static auto InitBlocksdirXorKey(const BlockManager::Options& opts)
 1{
 2    std::array<std::byte, 8> xor_key{};
 3    
 4    if (opts.use_xor && fs::is_empty(opts.blocks_dir)) {
 5        // Only use random fresh key when the boolean option is set and on the
 6        // very first start of the program.
 7        FastRandomContext{}.fillrand(xor_key);
 8    }
 9    
10    // [...]
11    return std::vector<std::byte>{xor_key.begin(), xor_key.end()};
12}

In turn, this causes us to go through the XOR logic of AutoFile::detail_fread() and AutoFile::write() since the block autofiles’ std::vector<std::byte> m_xor are initialized to the return value of InitBlocksdirXorKey.

E.g.: https://github.com/bitcoin/bitcoin/blob/fa460884406b35b0dee75af23f42e8b4c4acbebc/src/streams.cpp#L10-L23

I have a branch off of 28.x (https://github.com/davidgumberg/bitcoin/commit/aa0be4031cbe5bed407bc7a1753875cc1c4ddccb) that should fix this issue when -blocksxor=0 but I haven’t tested yet

I also managed to reproduce (n=1) the issue on Windows 10 LTSC Build 17763:

0bitcoind.exe -dbcache=450 -stopatheight=300000 -debug=bench -connect=localnode:8333

28.0rc1 took 7,368 seconds 27.1 took 3,594 seconds

fjahr commented at 10:10 am on September 8, 2024: contributor

This has uncovered a bug in #28052, InitBlocksdirXorKey does not return an empty std::vectorstd::byte when the -blocksxor=0 argument is used, it instead returns a vector with 8 bytes equal to zero:

I may be missing something because I am looking at the blocksxor stuff the first time but that doesn’t seem to be a bug. When XOR is disabled we still write the xor file with zeros. There is even a functional test for that behavior: https://github.com/bitcoin/bitcoin/blob/master/test/functional/feature_blocksxor.py#L64 Your branch seems to change that expected behavior.

tdb3 commented at 1:13 pm on September 8, 2024: contributor

Seems worth experimenting with:

https://github.com/bitcoin/bitcoin/blob/fa460884406b35b0dee75af23f42e8b4c4acbebc/src%2Fstreams.cpp#L76-L80

I wonder if the array creation, the copy, or the xoring here are the long pole in the tent? Experimenting with the all 0s xor key can help isolate.

I’ll have to take a look, but do we have any IBD tests/benches mimicking a significant IBD (e.g. first 400k blocks)? It would be good to catch performence regressions automatically.

Something on bitcoinperf.com? (looks like maybe the data there is stale)

Or something like a functional test that has two nodes, one node creating a chain of similar complexity/size (if we don’t want to store that part of mainnet), then the second node does IBD from first? Ci could run the test and be set to timeout for an excessively long IBD (which albeit is subjective so not ideal). Maybe it’s better to leave CI out of this?

bitcoin deleted a comment on Sep 8, 2024

davidgumberg commented at 10:00 pm on September 8, 2024: contributor

This has uncovered a bug in #28052, InitBlocksdirXorKey does not return an empty std::vectorstd::byte when the -blocksxor=0 argument is used, it instead returns a vector with 8 bytes equal to zero:

I may be missing something because I am looking at the blocksxor stuff the first time but that doesn’t seem to be a bug. When XOR is disabled we still write the xor file with zeros. There is even a functional test for that behavior: https://github.com/bitcoin/bitcoin/blob/master/test/functional/feature_blocksxor.py#L64 Your branch seems to change that expected behavior.

You’re right, creating an xor.dat with all zeroes when -blocksxor=0 was intended, so not a bug in that PR. Still, the fact that when a user disables xor’ing blocks we still perform all the usual XOR logic, but with an XOR pattern of zeroes doesn’t make sense to me, and it seems like an unintended side effect of #28052 .

I tested https://github.com/davidgumberg/bitcoin/commit/aa0be4031cbe5bed407bc7a1753875cc1c4ddccb with -blocksxor=0 and the same Windows setup and the performance regression goes away.

I’m still investigating, but my bench logs indicate that the regression is coming from writing block undo data, the difference between the (reported) total block connection time between v27.1 and v28.0rc1 is +1,337.85 seconds for v28, and the difference in (reported) time spent writing undo block data is +1,579.31s for v28:

 0 [bench]   - Using cached block
 1 [bench]   - Load block from disk: 0.57ms
 2 [bench]     - Sanity checks: 0.02ms [5.37s (0.02ms/blk)]
 3 [bench]     - Fork checks: 0.40ms [159.15s (0.53ms/blk)]
 4 [bench]       - Connect 512 transactions: 2.95ms (0.006ms/tx, 0.002ms/txin) [303.09s (1.01ms/blk)]
 5 [bench]     - Verify 1180 txins: 3.45ms (0.003ms/txin) [425.49s (1.42ms/blk)]
 6+[bench]     - Write undo data: 26.38ms [2296.37s (7.65ms/blk)]
 7 [bench]     - Index writing: 0.46ms [148.03s (0.49ms/blk)]
 8 [bench]   - Connect total: 31.98ms [3281.81s (10.94ms/blk)]
 9 [bench]   - Flush: 2.44ms [246.56s (0.82ms/blk)]
10 [bench]   - Writing chainstate: 0.89ms [107.89s (0.36ms/blk)]
11 UpdateTip: new best=000000000000000049a0914d83df36982c77ac1f65ade6a52bdced2ce312aba9 height=300001 version=0x00000002 log2_work=78.499665 tx=38464301 date='2014-05-10T06:51:23Z' progress=0.035833 cache=641.2MiB(5186028txo)
12 [bench]   - Connect postprocess: 0.64ms [261.25s (0.87ms/blk)]
13+[bench] - Connect block: 36.53ms [4041.91s (13.47ms/blk)]

 0 [bench]   - Using cached block
 1 [bench]   - Load block from disk: 1.03ms
 2 [bench]     - Sanity checks: 0.02ms [5.58s (0.02ms/blk)]
 3 [bench]     - Fork checks: 0.36ms [154.95s (0.52ms/blk)]
 4 [bench]       - Connect 173 transactions: 1.39ms (0.008ms/tx, 0.003ms/txin) [330.82s (1.10ms/blk)]
 5 [bench]     - Verify 435 txins: 2.02ms (0.005ms/txin) [478.57s (1.60ms/blk)]
 6+[bench]     - Write undo data: 1.24ms [717.06s (2.39ms/blk)]
 7 [bench]     - Index writing: 0.33ms [152.89s (0.51ms/blk)]
 8 [bench]   - Connect total: 5.11ms [1754.52s (5.85ms/blk)]
 9 [bench]   - Flush: 1.35ms [310.98s (1.04ms/blk)]
10 [bench]   - Writing chainstate: 0.33ms [149.58s (0.50ms/blk)]
11 UpdateTip: new best=000000000000000082aee4ff546c1db5e1aa5f9bfbaa0c76300a792b3e91fce7 height=300003 version=0x00000002 log2_work=78.499897 tx=38464563 date='2014-05-10T06:58:47Z' progress=0.035342 cache=510.6MiB(4668409txo)
12 [bench]   - Connect postprocess: 0.86ms [356.88s (1.19ms/blk)]
13+[bench] - Connect block: 8.68ms [2704.06s (9.01ms/blk)]

My modified branch of 28.x with -blocksxor=0 bench logs

 0 [bench]   - Using cached block
 1 [bench]   - Load block from disk: 0.23ms
 2 [bench]     - Sanity checks: 0.01ms [5.35s (0.02ms/blk)]
 3 [bench]     - Fork checks: 0.29ms [150.64s (0.50ms/blk)]
 4 [bench]       - Connect 512 transactions: 2.18ms (0.004ms/tx, 0.002ms/txin) [308.30s (1.03ms/blk)]
 5 [bench]     - Verify 1180 txins: 2.58ms (0.002ms/txin) [448.44s (1.49ms/blk)]
 6+[bench]     - Write undo data: 1.62ms [758.14s (2.53ms/blk)]
 7 [bench]     - Index writing: 0.30ms [127.39s (0.42ms/blk)]
 8 [bench]   - Connect total: 5.72ms [1710.90s (5.70ms/blk)]
 9 [bench]   - Flush: 2.31ms [258.95s (0.86ms/blk)]
10 [bench]   - Writing chainstate: 0.49ms [142.43s (0.47ms/blk)]
11 UpdateTip: new best=000000000000000049a0914d83df36982c77ac1f65ade6a52bdced2ce312aba9 height=300001 version=0x00000002 log2_work=78.499665 tx=38464301 date='2014-05-10T06:51:23Z' progress=0.035817 cache=641.2MiB(5186028txo)
12 [bench]   - Connect postprocess: 0.89ms [305.06s (1.02ms/blk)]
13+[bench] - Connect block: 9.64ms [2531.72s (8.44ms/blk)]

sipa commented at 10:15 pm on September 8, 2024: member

@davidgumberg Well from the (apparently mistaken) assumption that the XOR logic comes with no material performance cost, it made sense to always want to use it (even with a pattern of zero), just from a perspective of minimizing different code branch combinations.

If the XOR logic is too slow (on some systems), as it appears to be, we have two options:

Disabling it when (nonzero) XORing is not enabled, but that leaves us with the uncomfortable choice to make of (perhaps selectively) disabling (nonzero) XORing by default, as Windows users will otherwise still suffer with 28.0.
Fixing the performance issue.

andrewtoth commented at 10:15 pm on September 8, 2024: contributor

My bench logs indicate that the regression is coming from writing block undo data, the difference between the (reported) total block connection time between v27.1 and v28.0rc1 is +1,337.85 seconds for v28, and the difference in (reported) time spent writing undo block data is +1,579.31s for v28:

This is saying that the total time for writing undo block data exceeds total block connection time. But, writing undo block data is a subsection of block connection time, so it should always be less no?

Also, the logs show Using cached block, meaning it is not counting saving the block itself to disk in the Connect block benchmark. So, actually saving the blocks in AcceptBlock might be taking even more time than writing the undo blocks.

davidgumberg commented at 11:46 pm on September 8, 2024: contributor

@andrewtoth

My bench logs indicate that the regression is coming from writing block undo data, the difference between the (reported) total block connection time between v27.1 and v28.0rc1 is +1,337.85 seconds for v28, and the difference in (reported) time spent writing undo block data is +1,579.31s for v28:

This is saying that the total time for writing undo block data exceeds total block connection time. But, writing undo block data is a subsection of block connection time, so it should always be less no?

Sorry, I did not express that well, the two numbers I gave are the delta between v27.1 and v28.0rc1, total time spent writing undo block data is 2,296.37s in v28 and 717.06s in v27.1. Total time spent doing block connection is 4041.91s in v28.0rc1 and 2704.06s in v27.1.

Other parts of the v28 sync are faster (at least in my n=1 runs) , e.g. flushing the temporary CCoinsViewCache takes ~60 fewer seconds in v28 compensating for the increased time spent writing undo data.

Also, the logs show Using cached block, meaning it is not counting saving the block itself to disk in the Connect block benchmark. So, actually saving the blocks in AcceptBlock might be taking even more time than writing the undo blocks.

I see, I guess that explains why the difference in wall clock time taken (v28 took 3,774 more seconds) is more than double the size of the slowdown reported by the ConnectBlock bench logs (v28 took 1,337.85 seconds longer doing block connection).

@tdb3

Seems worth experimenting with:

https://github.com/bitcoin/bitcoin/blob/fa460884406b35b0dee75af23f42e8b4c4acbebc/src%2Fstreams.cpp#L76-L80

I wonder if the array creation, the copy, or the xoring here are the long pole in the tent? Experimenting with the all 0s xor key can help isolate.

It seems there was some prior discussion around this in #28060 which may be of interest: #28060 (review)

I wonder if this issue could be caused by number of calls we make to fwrite because of the write buffering in xor path of AutoFile::write(), maybe platforms other than windows are optimizing our writes into batches behind the scenes? I don’t know much about how I/O works behind the scenes, just a guess.

sipa commented at 1:41 pm on September 9, 2024: member

I wonder if this issue could be caused by number of calls we make to fwrite because of the write buffering in xor path of AutoFile::write(), maybe platforms other than windows are optimizing our writes into batches behind the scenes? I don’t know much about how I/O works behind the scenes, just a guess.

I have not checked, but this would be my guess.

maflcko commented at 2:16 pm on September 9, 2024: member

The xor patch didn’t change buffering. Yes, in theory there is a 4096 byte max buffer, however, this is never hit in practice. Undo data consists of small integers (4 bytes or 1 byte) and short strings (like output scripts).

I expect that if you print the number of writes before and after, they are exactly the same. If you also print the size of the writes they should show a histogram where 1-byte writes are the most common (probably all varint), followed by various writes all shorter than 33 bytes (or whatever the largest output script is).

I don’t have Windows, so I can’t say what is happening there and why something is observable there, while the same isn’t observable on Linux.

It may be useful to run the Xor benchmark on the same machine in Windows and on Linux, to see if the performance differs there: ./bench_bitcoin -filter=Xor -min-time=100.

Also, it is possible to write a new benchmark, to see if the xor overhead is unexpectely large inside AutoFile:

 0diff --git a/src/bench/xor.cpp b/src/bench/xor.cpp
 1index fc9dc5d172..96cb952c3b 100644
 2--- a/src/bench/xor.cpp
 3+++ b/src/bench/xor.cpp
 4@@ -21,4 +21,19 @@ static void Xor(benchmark::Bench& bench)
 5     });
 6 }
 7 
 8+static void AutoFileXor(benchmark::Bench& bench)
 9+{
10+    FastRandomContext frc{/*fDeterministic=*/true};
11+    auto data{frc.randbytes<std::byte>(4'000'000)};
12+    auto key{frc.randbytes<std::byte>(8)};
13+
14+    auto path{path_to_a_temp_file_on_the_storage_device_under_test};
15+
16+    bench.batch(data.size()).unit("byte").run([&] {
17+        AutoFile f{fsbridge::fopen(path, "wb+"), key};
18+        f << data;
19+    });
20+}
21+
22 BENCHMARK(Xor, benchmark::PriorityLevel::HIGH);
23+BENCHMARK(AutoFileXor, benchmark::PriorityLevel::HIGH);

davidgumberg referenced this in commit 93da655b8b on Sep 10, 2024

davidgumberg referenced this in commit 7d57d0f509 on Sep 10, 2024

davidgumberg referenced this in commit 19997e483f on Sep 10, 2024

davidgumberg referenced this in commit 14f51cc34d on Sep 10, 2024

davidgumberg referenced this in commit 13d749703d on Sep 10, 2024

maflcko commented at 9:38 am on September 10, 2024: member

Apart from the benchmark, it may also be useful to create a flamegraph/heatmap, so that it is easy to see where the time is spent.

mzumsande commented at 2:28 pm on September 10, 2024: contributor

I managed to get bitcoind compiled on my windows installation (Windows 10 Pro 22H2, MSVC 2022), didn’t observe a notable difference in the benchmarks. I’m still in the process of trying to test the sync speed, but if it’s due to the undo data, then a reindex-chainstate should be sufficient to see the difference?! Did anyone try that already?

hebasto commented at 3:03 pm on September 10, 2024: member

I managed to get bitcoind compiled on my windows installation (Windows 10 Pro 22H2, MSVC 2022), didn’t observe a notable difference in the benchmarks.

I hope everyone measuring performance on Windows has disabled all antivirus software, at least for the data directory. Right?

vostrnad commented at 3:21 pm on September 10, 2024: none

@hebasto Yes, I have disabled antivirus for both the data directory and the directory where the binaries are located, I checked that the antivirus CPU utilization stays low during the benchmarks, and I’m using identical directory locations for both versions so they wouldn’t be treated differently by the antivirus anyway.

bitcoin deleted a comment on Sep 10, 2024

achow101 commented at 2:22 pm on September 12, 2024: member

Discussion from today’s IRC meeting is that we should move forward with @davidgumberg’s patch and default to false on Windows for this release.

davidgumberg commented at 3:25 pm on September 12, 2024: contributor

@achow101 Reworking the patch that fixes disabling XOR, will open a PR later today if @sipa’s patch: #30884 to remove the ftell that happens on the XOR path doesn’t fix this issue.

achow101 closed this on Sep 17, 2024

28.0rc1 synchronizes much slower on Windows #30833