Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches.
This is especially relevant now that #31551 was merged, having bigger obfuscatable chunks.
Since this obfuscation is optional, the speedup measured here depends on whether it’s a random value or completely turned off (i.e. XOR-ing with 0).
Changes in testing, benchmarking and implementation
Added new tests comparing randomized inputs against a trivial implementation and performing roundtrip checks with random chunks.
Migrated std::vector<std::byte>(8) keys to plain uint64_t;
Process unaligned bytes separately and unroll body to 64 bytes.
Assembly
Memory alignment is enforced by a small peel-loop (std::memcpy is optimized out on tested platform), with an std::assume_aligned<8> check, see the Godbolt listing at https://godbolt.org/z/35nveanf5 for details
Target & Compiler
Stride (per hot-loop iter)
Main operation(s) in loop
Effective XORs / iter
Clang x86-64 (trunk)
64 bytes
4 × movdqu → pxor → store
8 × 64-bit
GCC x86-64 (trunk)
64 bytes
4 × movdqu/pxor sequence, enabled by 8-way unroll
8 × 64-bit
GCC RV32 (trunk)
8 bytes
copy 8 B to temp → 2 × 32-bit XOR → copy back
1 × 64-bit (as 2 × 32-bit)
GCC s390x (big-endian 14.2)
64 bytes
8 × XC (mem-mem 8-B XOR) with key cached on stack
8 × 64-bit
Endianness
The only endianness issue was with bit rotation, intended to realign the key if obfuscation halted before full key consumption.
Elsewhere, memory is read, processed, and written back in the same endianness, preserving byte order.
Since CI lacks a big-endian machine, testing was done locally via Docker.
71eb6eaa74 test: compare util::Xor with randomized inputs against simple impl
46854038e7 optimization: migrate fixed-size obfuscation from std::vector<std::byte> to uint64_t
0Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888-dbcache=4500-blocksonly -printtoconsole=0 (COMMIT =71eb6eaa740ad0b28737e90e59b89a8e951d90d9)
1 Time (mean ± σ): 37676.293 s ±83.100 s [User: 36900.535 s, System: 2220.382 s]
2 Range (min … max): 37617.533 s …37735.053 s 2 runs
3 4Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888-dbcache=4500-blocksonly -printtoconsole=0 (COMMIT =46854038e7984b599d25640de26d4680e62caba7)
5 Time (mean ± σ): 36181.287 s ±195.248 s [User: 34962.822 s, System: 1988.614 s]
6 Range (min … max): 36043.226 s …36319.349 s 2 runs
7 8Relative speed comparison
91.04±0.01 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888-dbcache=4500-blocksonly -printtoconsole=0 (COMMIT =71eb6eaa740ad0b28737e90e59b89a8e951d90d9)
101.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888-dbcache=4500-blocksonly -printtoconsole=0 (COMMIT =46854038e7984b599d25640de26d4680e62caba7)
DrahtBot
commented at 7:11 am on October 24, 2024:
contributor
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
Conflicts
Reviewers, this pull request conflicts with the following ones:
#31860 (init: Take lock on blocks directory in BlockManager ctor by TheCharlatan)
#29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)
#29307 (util: explicitly close all AutoFiles that have been written by vasild)
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
l0rinc force-pushed
on Oct 24, 2024
DrahtBot added the label
CI failed
on Oct 24, 2024
in
src/streams.h:40
in
10b9e68768outdated
41- key_offset %= key.size();
4243- for (size_t i = 0, j = key_offset; i != write.size(); i++) {
44- write[i] ^= key[j++];
45+ if (size_t remaining = write.size() - i; key.size() == 8 && remaining >= 8) { // Xor in 64-bit chunks
46+ const auto key64 = *std::bit_cast<uint64_t*>(key.data());
Thanks, I’ll investigate.
I assumed there will be more to check, that’s why it’s still a draft.
maflcko
commented at 8:27 am on October 24, 2024:
member
I think your example may be a bit skewed? It shows how much time is spent when deserializing a CScript from a block file. However, block files contain full blocks, where many (most?) of the writes are single bytes (or 4 bytes), see #30833 (comment). Thus, it would be useful to know what the overall end-to-end performance difference is. Also taking into account the utxo db.
If you want the micro-benchmark to be representative, I’d presume you’d have to mimic the histogram of the sizes of writes. Just picking one (1024, or 1004), which is never hit in reality, and then optimizing for that may be misleading.
l0rinc force-pushed
on Oct 24, 2024
l0rinc force-pushed
on Oct 24, 2024
l0rinc
commented at 10:02 am on October 24, 2024:
contributor
where many (most?) of the writes are single bytes (or 4 bytes)
Thanks, I’ve extended your previous benchmarks with both Autofile serialization and very small vectors.
I will also run a reindex of 400k blocks before and after to see if the effect is measurable or not.
in
src/streams.h:44
in
6ae466bf11outdated
42+ if (key.size() == 8 && write.size() - i >= 8) { // Xor in 64-bit chunks
43+ uint64_t key64;
44+ std::memcpy(&key64, key.data(), 8);
45+ for (; i <= write.size() - 8; i += 8) {
46+ uint64_t write64;
47+ std::memcpy(&write64, write.data() + i, 8);
i have a hard time believing this will make a large difference, especially with the two memcpys involved
On modern CPUs, ALU operations (especially bitwise ones) are so fast compared to any kind of memory access.
And this isn’t some advanced crypto math, it’s one Xor operation per word with a fixed key.
Could avoid the memcpys if the code takes memory alignment into account, but that makes it even more complex. Not sure the pros/cons work out here.
The speedup comes from the vectorized operations, i.e. doing 64 bit xor instead of byte-by-byte xor (memcpy seems to be eliminated on 64 bit architectures successfully), see https://godbolt.org/z/Koscjconz
Added RISC-V compiler to https://godbolt.org/z/n5rMeYeas - where it seems to my untrained eyes that it uses two separate 32 bit xors to emulate the 64 bit operation (but even if it’s byte-by-byte on 32 bit processors, that’s still the same as what it was before on 64 bit CPUs, right?).
Edit:
Memory alignment is handled via std::memcpy, optimized out on tested platforms (see godbolt.org/z/dcxvh6abq):
Clang (x86-64) - 32 bytes/iter using SSE vector operations
GCC (x86-64) - 16 bytes/iter using unrolled 64-bit XORs
RISC-V (32-bit) - 8 bytes/iter using load/XOR/store sequence
s390x (big-endian) - 64 bytes/iter with unrolled 8-byte XORs
(please validate, my assembly knowledge is mostly academic)
in
src/streams.h:51
in
6ae466bf11outdated
59- // way instead of doing a %, which would effectively be a division
60- // for each byte Xor'd -- much slower than need be.
61- if (j == key.size())
62- j = 0;
63+ for (size_t j = 0; i < write.size(); ++i, ++j) {
64+ write[i] ^= key[j % key.size()];
Thanks for the hint, I deliberately removed that (please check the commit messages for details), since these are optimized away.
Also, this is just the leftover part, so for key of length 8 (the version used in most places) this will have 7 iterations at most.
Can you see any difference with any of the benchmarks?
It’s only up to 7 iterations (assuming the key size is 8), sure, youre’re right.
But ok, yea, i’m a bit divided about relying on specific non-trivial things being optimized out, makes the output very dependent on specific compiler decisions (which may be fickle in some cases).
Often the simplest code gets optimized most, since it’s more predictable.
Would you like me to extend the test or benchmark suite or try something else to make sure we’re comfortable with the change?
To get rid of the goosebumps I’m handling the remaining 4 bytes as a single 32 bit xor now, so the final loop (when keys are 8 bytes long, which is mostly the case for us, I think) does 3 iterations at most. So even if it’s not optimized away, we should be fine doing 3 divisions by a nice round number like 8.
maflcko
commented at 12:17 pm on October 24, 2024:
divisions by a nice round number like 8.
I don’t think the compiler knows the number here, so can’t use it to optimize the code based on it.
Usually these optimizations concentrate on the measurable parts based on the profiling results that I’m getting during reindexing or IBD. Obfuscating a single bit (i.e. XorSmall) wasn’t my focus, it’s already very fast, didn’t seem like the bottleneck.
Would you like me to concentrate on that scenario instead? Or would it make more sense to serialize a block and use that as the basis for the benchmarks?
C++ compiler …………………….. AppleClang 16.0.0.16000026
Would you like me to concentrate on that scenario instead? Or would it make more sense to serialize a block and use that as the basis for the benchmarks?
Well no. I think this has been mentioned previously. Generally, optimizing for micro benchmarks may not yield results that are actually meaningful or visible for end-users, because the benchmarks capture only a very specific and narrow view. Optimizing for one could even make the code slower for another (as observed above). Adding a bench for the block couldn’t hurt, but I haven’t checked how representative it is. If such a bench represents the IBD behavior, it would be ideal. (There already is a block in the hex bench data, which could be used)
Usually these optimizations concentrate on the measurable parts based on the profiling results that I’m getting during reindexing or IBD
Yes, that is more useful. It would be good to share the number you got. Because the commit message simply claims that no benefit was found (“The if (j == key.size()) optimization wasn’t kept since the benchmarks couldn’t show any advantage anymore”).
XorSmall
Looks like you can reproduce the slowdown. I wonder if it is correlated with the use of libc++ vs libstdc++
The reindex-chainstate until 600k, 2 runs just finished - comparing master against the 64/32 bit packing (current state) on Linux (with GCC, showing the above inconsistency).
0Benchmark 1: COMMIT=dea7e2faf1bc48f96741ef84e25e6f47cefd5a92 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=500 -printtoconsole=0 -reindex-chains
1tate -connect=02 Time (mean ± σ): 12819.367 s ± 35.155 s [User: 11992.168 s, System: 2509.200 s]3 Range (min … max): 12794.508 s … 12844.225 s 2 runs
45Benchmark 2: COMMIT=353915bae14b9704a209bc09b021d3dd2ee11cf2 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=500 -printtoconsole=0 -reindex-chains
6tate -connect=07 Time (mean ± σ): 12685.350 s ± 19.878 s [User: 11918.349 s, System: 2523.819 s]8 Range (min … max): 12671.295 s … 12699.406 s 2 runs
Reindexing is a lot more stable than IBD (as seen from multiple measurements), showing a consistent 1% speedup.
Not earth-shattering, but at least this way the obfuscation isn’t causing a speed regression anymore.
Adding a bench for the block couldn’t hurt, but I haven’t checked how representative it is
I’ve change the usages to avoid std::vector keys, this way GCC and Clang both agree that the new results are faster (even though clang manages to compile to 32 byte SIMD, while GCC only to 16 bytes per iteration, see #31144 (review))
DrahtBot removed the label
CI failed
on Oct 24, 2024
l0rinc force-pushed
on Oct 24, 2024
l0rinc force-pushed
on Oct 24, 2024
l0rinc renamed this:
optimization: pack util::Xor into 64 bit chunks instead of doing it byte-by-byte
optimization: pack util::Xor into 64/32 bit chunks instead of doing it byte-by-byte
on Oct 24, 2024
l0rinc marked this as ready for review
on Oct 24, 2024
laanwj added the label
Block storage
on Oct 24, 2024
maflcko
commented at 7:14 pm on October 24, 2024:
member
It would be good to explain the jpegs in the description, or even remove them. They will be excluded from the merge commit and aren’t shown, unless GitHub happens to be reachable and online. Are they saying that IBD was 4% faster? Also, I think they were created with the UB version of this pull, so may be outdated either way?
I did a quick check on my laptop and it seems the XorSmall (1+4 bytes) is slower with this pull. The Xor (modified to check 40 bytes) was twice as fast. Overall, I’d expect it to be slower on my machine, due to the histogram of real data showing more small byte writes than long ones, IIRC.
I can try to bench on another machine later, to see if it makes a difference.
Can you clarify what type of machine you tested this on?
l0rinc
commented at 8:39 pm on October 24, 2024:
contributor
Are they saying that IBD was 4% faster?
That’s what I’m measuring currently, but I don’t expect more than 2% difference here.
Also, I think they were created with the UB version of this pull, so may be outdated either way?
Benchmarks indicated that the 64 bit compiled result was basically the same.
Overall, I’d expect it to be slower on my machine, due to the histogram of real data showing more small byte writes than long ones, IIRC.
I’ll investigate, thanks.
Posting the perf here for reference:
Reindexing until 300k blocks reveals that XOR usage was reduced:
in
src/test/streams_tests.cpp:374
in
a3dc138798outdated
296@@ -270,7 +297,7 @@ BOOST_AUTO_TEST_CASE(streams_buffered_file)
297 BOOST_CHECK(false);
298 } catch (const std::exception& e) {
299 BOOST_CHECK(strstr(e.what(),
300- "Rewind limit must be less than buffer size") != nullptr);
301+ "Rewind limit must be less than buffer size") != nullptr);
hodlinator
commented at 12:41 pm on October 25, 2024:
Seems like prior author just rounded off to some not too unreasonable tab-indentation (efd2474d17098c754367b844ec646ebececc7c74). Function isn’t touched in this PR so should probably resist touching here and below.
hodlinator
commented at 12:46 pm on October 25, 2024:
Experimented with changed to brace-initialization which uncovered some slight narrowing/widening occurring. (Thought I had an angle for making the code more robust in a more material way but that attempt failed).
Operating on CPU words rather than individual bytes. :+1:
Not entirely clear to me from #31144 (review) whether the optimizer is able to use SIMD. Guess picking through the binary of a GUIX-build would give a definitive answer.
The verbosity of std::memcpy hurts readability but alignment issues are real.
l0rinc marked this as a draft
on Oct 25, 2024
l0rinc force-pushed
on Oct 26, 2024
l0rinc force-pushed
on Oct 26, 2024
l0rinc force-pushed
on Oct 26, 2024
DrahtBot
commented at 9:58 pm on October 26, 2024:
contributor
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
DrahtBot added the label
CI failed
on Oct 26, 2024
l0rinc force-pushed
on Oct 26, 2024
l0rinc renamed this:
optimization: pack util::Xor into 64/32 bit chunks instead of doing it byte-by-byte
optimization: change XOR obfuscation key from `std::vector<std::byte>(8)` to `uint64_t`
on Oct 26, 2024
l0rinc force-pushed
on Oct 27, 2024
DrahtBot removed the label
CI failed
on Oct 27, 2024
l0rinc force-pushed
on Oct 27, 2024
l0rinc force-pushed
on Oct 27, 2024
l0rinc marked this as ready for review
on Oct 27, 2024
l0rinc force-pushed
on Oct 29, 2024
l0rinc force-pushed
on Oct 29, 2024
DrahtBot added the label
CI failed
on Oct 29, 2024
DrahtBot
commented at 12:49 pm on October 29, 2024:
contributor
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
DrahtBot removed the label
CI failed
on Oct 29, 2024
hodlinator
commented at 12:40 pm on October 31, 2024:
contributor
hodlinator
commented at 1:09 pm on October 31, 2024:
While key_exists is only created for this ìf`-block, there are other conditions involved, and we test for the negation of the value, so I find it less surprising to revert to the previous approach.
Done both, thanks, good observation about the negation
hodlinator
commented at 3:48 pm on October 31, 2024:
My point about moving the negation and changing the name made more sense in the context of keeping it inside the if-block. If you are open to moving it out, I’d say it’s better to keep the original key_exists name and original negation to avoid the churn and make it easier to review.
(Realized another reason for not having it inside the if-block is that we are mutating obfuscate_key_vector, which is used after the block).
hodlinator
commented at 1:16 pm on October 31, 2024:
I find it more useful to (static) assert that the std::array size matches the uint64 size directly. Also don’t see a point in zeroing out the local variable before returning?
I’ve added the fills to make sure we’re not using them after conversion anymore.
What would be the advantage of the static asserts?
I don’t mind removing these failsafes if you think they’re redundant or noisy.
maflcko
commented at 1:18 pm on October 31, 2024:
member
Taking a step back, I wonder if this is worth it. IIRC it gives a +1% speedup when run on spinning storage, so it seems a higher speedup is possibly visible on faster, modern storage. However, it would be good if any IO delay was taken out of IBD completely, so that the speed of storage and the speed of XOR is largely irrelevant.
I haven’t looked, but this may be possible by asking the next block the be read into memory in the background, as soon as work on the current block begins.
hodlinator
commented at 1:20 pm on October 31, 2024:
Don’t see much point in adding the assert here in 23fc898514bf9696facbaff65251b62c362d214e were we still only have a fixed-size std::array with the asserted fixed size of 8. Seems sufficient with the assert in the BlockManager-ctor.
l0rinc
commented at 1:24 pm on October 31, 2024:
contributor
Taking a step back, I wonder if this is worth it. IIRC it gives a +1% speedup
That’s not the main point, rather that we’re storing the key in a value that can be short-circuited easily so that when key is 0 (i.e. xor is NOOP), we can skip it. Previously this would have only been possible by checking each byte of the key.
It’s also a lot cleaner to store it in a primitive instead, which supports xor natively.
Xor comes up in every profiling I do, we shouldn’t have a regression because of #28207 - this PR solves that.
in
src/test/streams_tests.cpp:262
in
850214ffd9outdated
261@@ -235,7 +262,7 @@ BOOST_AUTO_TEST_CASE(streams_serializedata_xor)
262 // Single character key
hodlinator
commented at 1:29 pm on October 31, 2024:
In 850214ffd9f56e887a18d0428d5881e6c1ee8652:
Single/Multi character key comments don’t make sense inside of this commit.
hodlinator
commented at 3:39 pm on October 31, 2024:
(Should be done in the initial commit which invalidates the comments IMO).
maflcko
commented at 1:39 pm on October 31, 2024:
member
we shouldn’t have a regression because of #28207 - this PR solves that.
It is not possible to do XOR without any cost at all. There will always be an overhead and I think calling #28207 a regression and this change a “fix” is not entirely accurate. This change reduces the overhead, according to the benchmarks.
That’s not the main point
The pull request title starts with “optimization”, so I got the impression that a speedup is the main point.
The reason that std::vector was picked is that switching to larger sizes is possible. However, if there was a need to do that, XOR would likely not be sufficient anyway. So limiting to 8 bytes fixed at compile time seems reasonable.
I am mostly saying that any speedup here may not be visible at all if IO is completely taken out of the critical path, but I haven’t looked into that in detail.
l0rinc
commented at 1:52 pm on October 31, 2024:
contributor
change a “fix” is not entirely accurate
when xor is disabled we’re not xor-ing now at all. Previously we did xor, so this change brings back the previous behavior when xor is not needed.
The pull request title starts with “optimization”, so I got the impression that a speedup is the main point.
Yes, please see the updated description with the benchmarks: #31144#issue-2610689777
may not be visible at all if IO is completely taken out of the critical path
I’d argue the current implementation is slightly simpler (i.e. xor is stored and performed natively and can be disabled) and faster (2x for a representative dataset).
Since my previous Concept ACK, the PR was changed to switch the xor key more completely to uint64_t. Before the PR, we were already using fixed-size of 8 bytes for the obfuscation value in the file formats, so changing the type to uint64_t shouldn’t be noticeable to users. :+1:
Even if we could move reading and XOR-ing out of the hot path as suggested by maflcko, we might as well make use the CPU architectures we have. I would expect larger-sized XOR operations to have less overhead and energy waste (less heat).
maflcko
commented at 5:32 pm on October 31, 2024:
member
We could have used ReadLE64 to unify the byte order for keys and writable values, but that shouldn’t be necessary, since both have the same endianness locally that shouldn’t be affected by a byte-by-byte xor.
The s390x unit tests fail:
0./src/test/streams_tests.cpp(40): error: in "streams_tests/xor_bytes": check { expected.begin(), expected.end() } == { actual.begin(), actual.end() } has failed.
1Mismatch at position 0: � != |
2Mismatch at position 1: Y != �
3Mismatch at position 2: � != �
4Mismatch at position 3: � != �
5Mismatch at position 4: w != �
6Mismatch at position 5: C != �
7Mismatch at position 6: != x
8Mismatch at position 7: � != �
9Mismatch at position 8: � != C
10Mismatch at position 9: Y != �
11Mismatch at position 10: � != �
12Mismatch at position 11: , != �
13Mismatch at position 12: � != R
14Mismatch at position 13: 8 != �
15Mismatch at position 14: � != �
16Mismatch at position 15: � != �
17Mismatch at position 16: � != l
18Mismatch at position 17: � != n
19Mismatch at position 18: � != �
20Mismatch at position 19: t != B
21Mismatch at position 20: ; != �
22Mismatch at position 21: != �
23Mismatch at position 22: � != �
24Mismatch at position 23: � != �
25Mismatch at position 24: k != �
26Mismatch at position 25: � != Z
27Mismatch at position 26: � != �
28Mismatch at position 27: � != �
29Mismatch at position 28: � != #
30Mismatch at position 29: 8 != �
31Mismatch at position 30: � != �
32Mismatch at position 31: � != �
33Mismatch at position 32: g != �
34Mismatch at position 33: � != ^
35Mismatch at position 34: � != �
36ismatch at position 36: � != k
37Mismatch at position 37: * != �
38Mismatch at position 38: q !=
39Mismatch at position 39: � != �
40Mismatch at position 40: � != �
41Mismatch at position 41: r != e
42Mismatch at position 42: � != �
l0rinc
commented at 5:42 pm on October 31, 2024:
contributor
The s390x unit tests fail:
I don’t know how to access that, is it part of CI? Does the test suite pass on it otherwise or was it just curiosity? Do you know the reason it fails, is my assumption incorrect that endianness should apply to both parts (key and value) the same way? Is the test wrong or the xor, should I change the test such that xor-ing twice should reveal the original data instead (while the intermediary part should not)?
maflcko
commented at 6:37 pm on October 31, 2024:
member
I don’t know how to access that, is it part of CI?
It needs to be run manually. See https://github.com/bitcoin/bitcoin/tree/master/ci#running-a-stage-locally. (podman run --rm --privileged docker.io/multiarch/qemu-user-static --reset -p yes may be required to setup qemu-s390x, depending on your setup). Then something like MAKEJOBS="-j$(nproc)" FILE_ENV="./ci/test/00_setup_env_s390x.sh" ./ci/test_run_all.sh should run it.
Does the test suite pass on it otherwise or was it just curiosity?
Yes, it should pass on s390x. If not, that is a bug somewhere.
l0rinc marked this as a draft
on Oct 31, 2024
l0rinc force-pushed
on Nov 2, 2024
l0rinc
commented at 7:45 pm on November 2, 2024:
contributor
Thanks for the hints @maflcko, I was under the impression that big-endian tests were run automatically.
Fix
It seem that that std::rotr doesn’t take endianness into account, so the fix basically looks like:
The change also includes an updated benchmarking suite with 700k blocks inspected for all usages of utils::Xor to make the benchmark representative:
Changed AutoFileXor benchmark to measure the regression of turning off obfuscation.
I’ve also updated all key derivations to vector-to-uint64 instead of generating the 64 bit key directly (it’s more consistent, more representative and helps with endianness).
Added a xor roundtrip test which applies the xor in random chunks, asserts that it differs from the original and reapplies it in different random chunks and asserts that it’s equivalent to the original.
l0rinc marked this as ready for review
on Nov 2, 2024
l0rinc force-pushed
on Nov 5, 2024
l0rinc
commented at 4:40 pm on November 5, 2024:
contributor
Updated the benchmark to 860k blocks (September 2024):
This one contains a lot of very big arrays (96'233 separate sizes, biggest was 3'992'470 bytes long) - a big departure from the previous 400k and 700k blocks (having 1500 sizes, biggest was 9319 bytes long).
The performance characteristics are also quite different, now that we have more and bigger byte arrays:
C++ compiler …………………….. AppleClang 16.0.0.16000026
Before:
ns/byte
byte/s
err%
total
benchmark
1.29
774,577,944.12
0.2%
115.99
XorHistogram
After:
ns/byte
byte/s
err%
total
benchmark
0.04
26,411,646,837.32
0.2%
8.97
XorHistogram
i.e. ~35x faster with Clang at processing the data with representative histograms.
C++ compiler …………………….. GNU 13.2.0
Before:
ns/byte
byte/s
err%
ins/byte
cyc/byte
IPC
bra/byte
miss%
total
benchmark
0.97
1,032,916,679.87
0.0%
9.01
3.29
2.738
1.00
0.0%
86.58
XorHistogram
After:
ns/byte
byte/s
err%
ins/byte
cyc/byte
IPC
bra/byte
miss%
total
benchmark
0.10
10,369,097,976.62
0.0%
0.32
0.33
0.985
0.06
0.6%
8.63
XorHistogram
i.e. ~10x faster with GCC at processing the data with representative histograms.
Edit: note that I couldn’t use random byte generation for each benchmark value since it timed out on CI. I have replaced it with getting subsets of a single big random vector.
l0rinc force-pushed
on Nov 5, 2024
DrahtBot added the label
CI failed
on Nov 5, 2024
DrahtBot
commented at 4:49 pm on November 5, 2024:
contributor
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
l0rinc force-pushed
on Nov 6, 2024
l0rinc force-pushed
on Nov 6, 2024
l0rinc force-pushed
on Nov 7, 2024
l0rinc force-pushed
on Nov 8, 2024
DrahtBot removed the label
CI failed
on Nov 8, 2024
l0rinc
commented at 10:59 am on November 11, 2024:
contributor
I ran a reindex-chainstate until 860k blocks (SSD, i7 CPU), before and after this change, 2 runs per commit.
As stated in the previous comment, the latter blocks (>700k) seem to contain a lot of very big vectors where the new algorithm shines.
3efc72ff7cbdfb788b23bf4346e29ba99362c120 - end-to-end vector to uint64
08db1794647c37f966c525411f931a4a0f6b6119 - dummy commit that forces obfuscation keys to be 0 to ignore xor operations (since -blocksxor=0 doesn’t affect dbwrapper), to see how xor affects speed in general
It doesn’t (at least not for all test cases), I’m deliberately testing generating vectors and converting them instead of generating 64 bit values since that’s what happens in production code (this should address multiple of your comments)
hodlinator
commented at 10:45 pm on November 11, 2024:
“Specify the optimization categories” means that the compiler will be able to optimize the cases where one of the parameters (the size) is a constant separately from each other. The default statement would work, but would be very slow, since the 1, 2 and 4 byte versions won’t be specialized.
hodlinator
commented at 10:42 pm on November 11, 2024:
How about
0// Help optimizers along by sending constant parameter values into the inlined function,
1// resulting in more efficient substitutions of memcpy() -> native pow-2 copy instructions.2switch (write.size()) {
3case0: break;
4case1: XorInt(write, key, 1); break;
5case2: XorInt(write, key, 2); break;
6case4: XorInt(write, key, 4); break;
7 default: XorInt(write, key, write.size());
l0rinc
commented at 11:07 pm on November 11, 2024:
I ended up with only // Help the compiler specialize 1, 2 and 4 byte cases, since the rest was just speculation
The values in the histogram (i.e. total bytes streamed through xor) add up to 92gb, but 1 byte values occupy half of that. (edit: this is only true for the first row, in other cases we would need to multiply by the first column)
Since we have thousands of big values that represent vector of that size, we have to include all of those into the test set at least once.
I had to scale down the histogram such that the lower values, having a few hundred occurrences aren’t flattened out completely - to make the histogram still representative. Do you have a better idea?
XorHistogram claims to use 8 GB of RAM. Could be a bit much if we want to be able to also run benchmarks on low-end devices.
Yes, but if I scale it down more, more values will be equal in the histogram and it won’t reflect real usage.
That’s why I’ve set it to low priority, we don’t have to run these for every execution.
Edit: pushed some nits to git range-diff 866f4fa521f6932162570d6531055cc007e3d0cd..a1232973189126cfc9526713011461709685fcc8 866f4fa521f6932162570d6531055cc007e3d0cd..57caa965b5ae284e501f892415d60fcb536f4c0e
l0rinc force-pushed
on Nov 11, 2024
l0rinc renamed this:
optimization: change XOR obfuscation key from `std::vector<std::byte>(8)` to `uint64_t`
optimization: change XOR obfuscation key from `std::vector<std::byte>{8}` to `uint64_t`
on Nov 11, 2024
hodlinator
commented at 11:43 pm on November 11, 2024:
contributor
Care to explain the scaling_factor value?
The values in the histogram (i.e. total bytes streamed through xor) add up to ~92gb, but 1 byte values occupy half of that. I had to scale down the histogram such that the lower values, having a few hundred occurrences aren’t flattened out completely - to make the histogram still representative. Do you have a better idea?
So something like 1.28TB instead of 92GB, but I can’t seem to get my head screwed on right today.
If I re-understood correctly what the code was doing with the scaling - it’s doing only 1'000'000 XOR-passes for the most common size (1 byte) instead of 47'584'838'861, and scaling down the number of XOR-passes for the others by the same factor.
fanquake
commented at 10:07 am on November 12, 2024:
member
I haven’t really looked at the changes here, but just looking at the diff (+96'000 lines), my feedback would be that you’ll need to change approach in regards to making your data available (if the intent is to have that included), as I doubt we’ll be adding 96'000 lines to bench/xor.cpp. You could look at how we generate a header from bench/data/block413567.raw for a different approach, as including a small binary blob, and parsing it into a header at compile time is far more palatable.
l0rinc
commented at 3:16 pm on November 12, 2024:
contributor
Thanks @fanquake, I thought of that, can you please help me understand the constraints?
Wouldn’t that require a cmake generation step from binary to header which would basically produce the exact same lines as what we have now?
Would it help if I simply extracted it to a separate header file instead?
So something like 1.28TB instead of 92GB, but I can’t seem to get my head screwed on right today.
Yeah, I’ve edited that part since, my napkin calculations were only true for the first row, in other cases we would need to multiply by the first column - like you did. But the point is that it’s a lot of data that we have to scale down.
but seems slightly slower in my measurements
I thought of that as well, but wanted to avoid floating point conversion (likely the reason for the slowness in your example)
fanquake
commented at 5:51 pm on November 12, 2024:
member
Wouldn’t that require a cmake generation step from binary to header which would basically produce the exact same lines as what we have now?
Yes. See bench/data/block413567.raw & bench/data/block413567.raw.h, where at build time a header file of ~125'000 lines is produced.
Would it help if I simply extracted it to a separate header file instead?
I don’t think so. The point is more to not add 100'000s of lines of “data” to this repo, which doesn’t scale across many benchmarks, creates unusable diffs, leaves (source) files unviewable on GH etc.
l0rinc force-pushed
on Nov 13, 2024
l0rinc
commented at 3:27 pm on November 13, 2024:
contributor
I understand you were burned by endianness but I disagree that it’s worth sacrificing readability where endianness is a non-issue.
Thanks @hodlinator for the suggestions, I tried them all, but in the end decided that I value consistency more than coming up with a separate solution for each test case. These are ugly, I agree, but at least they’re testing the setup we’re using in prod.
I did however change the hard-coded 8 values to sizeof xor_key (for memcpy) or sizeof(uint64_t) for vector inits.
The point is more to not add 100'000s of lines of “data” to this repo
I have stored the sorted diffs in the end (since the lines are correlated, i.e. more likely to have similar neighbours) and compressed them using .tar.gz (added the generator python script as a gist, please verify) - this way the histogram data is ~100 kb instead of 1.7 MB (thanks for the hint @fanquake).
I’ve extended GenerateHeaderFromRaw.cmake with compression support (adjusting GenerateHeaders.cmake to trim the suffix from the header name) and added more safety asserts to make sure the data read back is the same as before.
33+}
34+
35+static void XorHistogram(benchmark::Bench& bench)
36+{
37+ // The histogram represents util::Xor method's write.size() histograms for the first 860k blocks
38+ // aggregated and encoded with https://gist.github.com/l0rinc/a44da845ad32ec89c30525507cdd28ee
hodlinator
commented at 3:28 pm on December 3, 2024:
nits:
Although there is precedent for adding references to gists, I’m not sure we should encourage it. Would have preferred a file in this repo’s contrib/ directory.
Would also have preferred that it did the full calculation of the .tgz file instead of having a hard-coded array, computing it by looking at linearalized blocks on disk.
0₿ git range-diff master 57caa96 f2fd1f7
1₿ git show e314bb7e00 > old
2₿ git show 91a8fde051 > new
3₿ meld old new
Thanks for using more constexprstd::arrays and clearer sizeofs!
Nice that block data could be compressed to such a large extent.
nit: Would prefer the *.cmake changes where broken out into their own commit, keeping only src/bench/CMakeLists.txt as part of the benchmark change.
ryanofsky
commented at 5:00 pm on December 3, 2024:
contributor
Concept ACK, but curious for more feedback from @maflcko about this PR. The actual code changes here do not seem too complicated but maybe they make the code less generic. I wonder if you think there are concrete downsides to this PR, or if the changes are ok but possibly not be worth the review effort (as #31144 (comment) seems to suggest)
I’m happy to spend time reviewing this if it improves performance and doesn’t cause other problems.
this way the histogram data is ~100 kb instead of 1.7 MB
Current approach seems ok to me, but wondering it it might be better to just use a sampling of the most common write sizes instead of including the entire histogram. It seems like if you take the top 50 sizes it covers 99.6% of the writes, and might make the test more maintainable and PR easier to understand without changing results too much.
maflcko
commented at 5:09 pm on December 3, 2024:
member
Concept ACK, but curious for more feedback from @maflcko about this PR. The actual code changes here do not seem too complicated but maybe they make the code less generic.
There is a good chance that increasing the size of the vector is insufficient, if there is ever a need to increase it to more than 8 bytes, so a complete rewrite may be needed in that case anyway. However, this is just my guess and only time will tell. So I’d say this change is probably fine for now.
Would still be nice to if there was a way to take all of it out of the hot path (possibly with higher overall benefits), but I don’t know if such a change is possible and will replace this pull.
l0rinc
commented at 5:44 pm on December 3, 2024:
contributor
Would still be nice to if there was a way to take all of it out of the hot path
but wondering it it might be better to just use a sampling of the most common write sizes
@fanquake mentioned that he thinks this benchmark could be useful - if he’s fine with the truncated version as well, I’ll simplify (would solve some of @hodlinator’s cmake concerns as well).
maflcko
commented at 6:33 pm on December 3, 2024:
member
It seems like if you take the top 50 sizes it covers 99.6% of the writes
I had the same thought. Obviously there could be an unlikely problem if the remaining 0.04% of writes accounted for the majority of the time, but that seems unlikely. Other than that, taking only the top N seems preferable.
l0rinc
commented at 9:38 am on December 5, 2024:
contributor
Would still be nice to if there was a way to take all of it out of the hot path
Since blocks are XOR-ed as well, I can’t meaningfully test it with a reindex(-chainstate), so I did 2 full IBDs until 800k blocks, rebased after #30039 with -blocksxor=0 to test whether we can disable xor completely now.
0Benchmark 1: COMMIT=e1074081c9f1895a4f629dfee347ceae484a10d3 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -blocksxor=0 -dbcache=10000 -printtoconsole=01 Time (mean ± σ): 25797.921 s ± 61.629 s [User: 26803.189 s, System: 1457.936 s]2 Range (min … max): 25754.343 s … 25841.500 s 2 runs
34Benchmark 2: COMMIT=f2fd1f7c043a2782cb2bf3c9fe7e2f94c17728b5 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -blocksxor=0 -dbcache=10000 -printtoconsole=05 Time (mean ± σ): 23751.046 s ± 342.376 s [User: 25322.345 s, System: 1509.236 s]6 Range (min … max): 23508.949 s … 23993.142 s 2 runs
Which indicates a 9% speedup compared to the baseline:
0Summary
1COMMIT=f2fd1f7c043a2782cb2bf3c9fe7e2f94c17728b5 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -blocksxor=0 -dbcache=10000 -printtoconsole=0 ran
2 1.09 ± 0.02 times faster than COMMIT=e1074081c9f1895a4f629dfee347ceae484a10d3 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -blocksxor=0 -dbcache=10000 -printtoconsole=0
maflcko
commented at 9:46 am on December 5, 2024:
member
Would still be nice to if there was a way to take all of it out of the hot path (possibly with higher overall benefits), but I don’t know if such a change is possible and will replace this pull.
Actually, this change here also affects RPC performance, not just internal validation, so this can be done in a follow-up or separate pull, if it is possible at all.
l0rinc
commented at 1:48 pm on December 6, 2024:
contributor
More context:
The previous benchmark was for completely turning off XOR - but we can only do that for new IBD by explicitly setting it to 0. But for the majority of cases we likely want to still to the xor, so this PR is meant to speed it up.
I have remeasured it with doing full IBD until 800k blocks (two runs to measure stability, since reindex wouldn’t cover all usages of XOR):
0Benchmark 1: COMMIT=e1074081c9f1895a4f629dfee347ceae484a10d3 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=10000 -printtoconsole=01 Time (mean ± σ): 25601.461 s ± 65.686 s [User: 27025.116 s, System: 1586.908 s]2 Range (min … max): 25555.014 s … 25647.907 s 2 runs
34Benchmark 2: COMMIT=f2fd1f7c043a2782cb2bf3c9fe7e2f94c17728b5 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=10000 -printtoconsole=05 Time (mean ± σ): 24526.781 s ± 389.029 s [User: 25525.801 s, System: 1552.625 s]6 Range (min … max): 24251.697 s … 24801.866 s 2 runs
Which indicates that this will speed up IBD by roughly 4% on average (now that 30039 was merged the difference is more obvious):
0Summary
1COMMIT=f2fd1f7c043a2782cb2bf3c9fe7e2f94c17728b5 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=10000 -printtoconsole=0 ran
2 1.04 ± 0.02 times faster than COMMIT=e1074081c9f1895a4f629dfee347ceae484a10d3 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=10000 -printtoconsole=0
l0rinc renamed this:
optimization: change XOR obfuscation key from `std::vector<std::byte>{8}` to `uint64_t`
optimization: speed up XOR by 4% (9% when disabled) by applying it in larger batches
on Dec 6, 2024
l0rinc force-pushed
on Dec 6, 2024
l0rinc
commented at 5:30 pm on December 6, 2024:
contributor
Thanks for the reviews and hints, I’ve pushed the following changes:
Reverted all cmake changes @hodlinator mentioned and histogram archives, and based on the hints of @ryanofsky and @maflcko I’ve kept only the top entries (by frequency, re-sorted by size), making sure that the really big write-vectors are also covered (so kept the first 1000 instead of just the first 50). This enabled putting all the data in the sourcefile.
Added Assumes to each xor to check that we don’t have any useless calls with 0 keys - making sure we “turn off” the feature when we can.
yes, it’s an optimization to avoid doing any rotation when it would wrap around - it would work without this as well.
It’s not a measurable speedup, though, so I can remove it if you insist.
in
src/node/blockstorage.cpp:1178
in
b9c847fd09outdated
No, I mean just a simple struct. Have you seen my suggestion?
The benefits would be that passing the integral value around would be type-safe. Also, the endian-considerations are fully contained in a simple struct, as opposed to the xor internal implementation detail and all modules that use xor. Finally, the memcpy would also be contained in a single place. Overall this could make the code smaller, or not, depending on how many users are there. However, in any case, encapsulating the assumptions around type-safety, and endianness would already be worth it in my view.
ryanofsky
commented at 3:49 pm on December 10, 2024:
In commit “test: Compare util::Xor with randomized inputs against simple impl” (e1074081c9f1895a4f629dfee347ceae484a10d3)
Would be good to add comment explaining test. Test seems to be to encoding and then decoding random byte vectors with random 8 byte xor keys using differently-sized and differently-aligned random chunks for encoding and decoding, and then making sure the byte vectors are unchanged after the round trip.
ryanofsky
commented at 3:56 pm on December 10, 2024:
In commit “test: Compare util::Xor with randomized inputs against simple impl” (e1074081c9f1895a4f629dfee347ceae484a10d3)
Again adding a test comment could be helpful here. This test is making sure the util::Xor function returns same results as a naive byte-by-byte xor with an 8-byte key, using 100 random sized random byte vectors.
Would suggest moving xor_bytes_reference test up before the xor_roundtrip_random_chunks since it seems like a simpler test that’s an easier introduction to this code and could be followed by more complicated tests.
16+static void XorHistogram(benchmark::Bench& bench)
17 {
18- FastRandomContext frc{/*fDeterministic=*/true};
19- auto data{frc.randbytes<std::byte>(1024)};
20- auto key{frc.randbytes<std::byte>(31)};
21+ // The top util::Xor method's [write.size(), frequency] calls for the IBD of the first 860k blocks
ryanofsky
commented at 5:07 pm on December 10, 2024:
I think instead of taking the top 1000 calls by count, it would make sense to take the top X calls by (size*count) as suggested #31144 (comment), where X could be smaller than 1000, because (size*count) should more closely approximate the time spent on all writes of a given size than count ignoring size. This should make the test more realistic and also allow shrinking the histogram.
This isn’t needed anymore, since in #31551 we batch all the tiny calls now, so this PR only deals with doing the obfuscation on 64 bits
in
src/test/streams_tests.cpp:61
in
e1074081c9outdated
56+ for (size_t test{0}; test < 100; ++test) {
57+ const size_t write_size{1 + rng.randrange(100U)};
58+ const size_t key_offset{rng.randrange(3 * 8U)}; // Should wrap around
59+
60+ std::vector key_bytes{rng.randbytes<std::byte>(sizeof(uint64_t))};
61+ uint64_t key;
ryanofsky
commented at 11:00 pm on December 10, 2024:
In commit “test: Compare util::Xor with randomized inputs against simple impl” (e1074081c9f1895a4f629dfee347ceae484a10d3)
Would be good to use consistent variable names in these tests. The other tests are calling key vectors xor_pat and calling key values xor_key, while these tests are calling key vectors key_bytes and calling key values key. Would be clearer to use consistent names.
Also, after this PR each test is keeping two different variables and two different representations for each key. It would be good clean this up afterwards and just have one variable per key. Using a dedicated type for keys like the XorKey struct Marco suggested #31144 (review) would be even more ideal.
ryanofsky
commented at 11:23 pm on December 10, 2024:
In commit “bench: Make Xor benchmark more representative” (caafbd069246848a8bdfc2f42fd1d692a824de94)
Would be a helpful to have a comment saying what this benchmark is measuring. Maybe: // Measure speed of util::Xor function applied to a set of byte vectors. The byte vectors are filled with random data and have sizes matching a distribution of data write sizes observed during IBD.
1040+ test_data.emplace_back(rand_bytes);
1041+ }
1042+ }
1043+ assert(total_bytes == 114'929'502);
1044+
1045+ std::ranges::shuffle(test_data, rng); // Make it more realistic & less predictable
ryanofsky
commented at 11:48 pm on December 10, 2024:
In commit “bench: Make Xor benchmark more representative” (caafbd069246848a8bdfc2f42fd1d692a824de94)
It seems awkward to have code with all these hardcoded values that will make it hard to update the histogram, and to end up with test data we can’t directly control the size of. Would be cleaner to just choose how much data to generate, and generate it without hardcoding values from the histogram. Would suggest something more like the following, with an easily adjustable size that doesn’t hardcode other values.
ryanofsky
commented at 0:03 am on December 11, 2024:
In commit “bench: Make Xor benchmark more representative” (caafbd069246848a8bdfc2f42fd1d692a824de94)
Not really sure I understand the goal if this benchmark. Is it significant that the xor key is 0? Would be helpful to have a description of what this benchmark is measuring and indicating.
ryanofsky
commented at 0:28 am on December 11, 2024:
In commit “optimization: Xor 64 bits together instead of byte-by-byte” (7a2e5ec97700584eeac6f8b08ef697df6a147606)
Seems like it would be less fragile and would simplify callers to replace this Assume(key) with if (!key) return;
l0rinc
commented at 12:19 pm on December 11, 2024:
I could do that, but the Assume here was meant to make sure no calls get here when the key is 0 in the first place - since that can usually eliminate other work as well (e.g. MakeWritableByteSpan in DataStream#Xor)
ryanofsky
commented at 0:36 am on December 11, 2024:
In commit “optimization: Xor 64 bits together instead of byte-by-byte” (7a2e5ec97700584eeac6f8b08ef697df6a147606)
I’m confused why this code is changing in this commit when it seems unrelated to the stream.h optimization the behavior seems the same as before? If it his is a refactoring cleanup it would be good to move it to a separate commit explaining that it is a refactoring and what the point of these changes may be.
We can’t write the obfuscation vector directly anymore (since we’re storing an Obfuscation object now) and Read can only read into a vector, so this part needed a temp vector - which looks a bit awkward, indeed.
Split it out into a new commit, thanks for the hint!
in
src/node/mempool_persist.cpp:62
in
7a2e5ec977outdated
ryanofsky
commented at 0:42 am on December 11, 2024:
In commit “optimization: Xor 64 bits together instead of byte-by-byte” (7a2e5ec97700584eeac6f8b08ef697df6a147606)
Are all the changes in this file also a refactoring that don’t change behavior? I don’t understand why these changes are in a commit that is supposed to be optimizing stream.h behavior. Would suggest splitting this commit up and and explaining what purpose of these changes are. Maybe they would make more sense in the next commit so this code is not changing twice?
Now that we have a dedicated Obfuscation type these should be trivial now - let me know if you still think these should be split out to dedicated commits
ryanofsky
commented at 1:15 am on December 11, 2024:
contributor
Code review b9c847fd093d100628817af98fe837db938160f7. These changes look good and make sense, and I reviewed almost everything but have a few pieces of feedback:
I would very strongly endorse Marco’s suggestion to represent keys with an XorKey struct instead of raw uint64_t values so the code getting and setting keys is simpler, safer, and more uniform, and we can avoid a proliferation of memcpy calls.
I don’t think I understand structure of the third and fourth commits. The third commit seems to be adding an optimization to streams.h but also refactoring code not directly related to streams.h and then the fourth commit is refactoring a lot of the same code that was just refactored, and some of the code that was optimized as well. Would suggest doing this more cleanly in 3 commits:
First commit adding optimized stream.h/stream.cpp functions and wrappers to provide backwards compatibility so no other code or tests have to change in the commit.
Second commit updating code and tests to call the optimized stream.h API instead of the backwards compatibility wrappers.
Third commit deleting stream.h backwards compatibility wrappers.
l0rinc marked this as a draft
on Dec 14, 2024
l0rinc force-pushed
on Dec 21, 2024
DrahtBot added the label
CI failed
on Dec 21, 2024
DrahtBot
commented at 5:07 pm on December 21, 2024:
contributor
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
l0rinc force-pushed
on Dec 21, 2024
l0rinc renamed this:
optimization: speed up XOR by 4% (9% when disabled) by applying it in larger batches
optimization: batch XOR operations 12% faster IBD
on Dec 22, 2024
l0rinc
commented at 11:42 am on December 22, 2024:
contributor
The PR has been split into 3 to simplify review, please check those out first:
Time excludes the 8 minutes to flush 26Gib worth of chainstate to disk during shutdown. That itself is twice as fast as a few months ago: #30987 (comment)
After: 4 hours and 50 minutes
hodlinator
commented at 10:33 am on January 8, 2025:
contributor
3% speedup is less to write home about but still good. Having a less constrained (Sjors) -dbcache setting (30GB) would lead to less XOR read/writes.
12.3% speedup (l0rinc) could at least in part be explained by the 1GB setting, leading to more XOR-operations.
An optimization that provides a bigger win for constrained devices should be welcomed.
l0rinc
commented at 9:30 am on January 10, 2025:
contributor
Thank you @Sjors for testing it.
I was surprised to see your config only revealed a 3% change so I reran the full IBDs with the configs you had: -dbcache=30000 -stopatheight=878000 (I had -dbcache=1000 -stopatheight=870000 before).
I suspect the difference in our measurements could stem from doing a single run and not including the final dump in the measurements.
I was using a HDD this time and wasn’t seeding from local nodes, so the variance was a bit bigger for me, but I ran both before and after several times and there’s an obvious clustering (the before case was consistently slower than any after run, showing a ~12% speedup on average even with high dbcache):
0hyperfine \
1--runs 2\
2--parameter-list COMMIT d73f37dda221835b5109ede1b84db2dc7c4b74a1,fe7365584bb3703e5691c93fb004772e84db3697 \
3--prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)'\
4'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -dbcache=30000 -stopatheight=878000 -printtoconsole=0' 5 6Benchmark 1: COMMIT=d73f37dda221835b5109ede1b84db2dc7c4b74a1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -dbcache=30000 -stopatheight=878000 -printtoconsole=0 7 Time (mean ± σ): 40251.909 s ± 1669.663 s [User: 51304.669 s, System: 1889.767 s] 8 Range (min … max): 39071.279 s … 41432.539 s 2 runs
910Benchmark 2: COMMIT=fe7365584bb3703e5691c93fb004772e84db3697 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -dbcache=30000 -stopatheight=878000 -printtoconsole=011 Time (mean ± σ): 36062.225 s ± 916.289 s [User: 47770.885 s, System: 2097.642 s]12 Range (min … max): 35414.310 s … 36710.139 s 2 runs
1314Summary
15COMMIT=fe7365584bb3703e5691c93fb004772e84db3697 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -dbcache=30000 -stopatheight=878000 -printtoconsole=0 ran
16 1.12 ± 0.05 times faster than COMMIT=d73f37dda221835b5109ede1b84db2dc7c4b74a1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -dbcache=30000 -stopatheight=878000 -printtoconsole=0
maflcko
commented at 10:05 am on January 10, 2025:
member
Not sure how useful it is to derive speed improvements from measurements where the variance is about half as large as the difference itself. Not claiming this is the case here, but if you measure from the public network, you could very well just measure the bandwidth of the picked nodes (completely unrelated to this pull).
It is fine if you want to do those measurements locally for fun, but putting them in the pull request title and description doesn’t seem ideal. It would be better to focus on stable and reproducible measurements there.
l0rinc
commented at 10:09 am on January 10, 2025:
contributor
Usually the variance is a lot lower (see previous measurements), but these are just my benchmarks (I want them to be as close to reality as possible, that’s why I’m repeating them to have some predictability), I would appreciate if you could provide independent measurements that you find more stable.
hodlinator
commented at 10:43 am on January 11, 2025:
contributor
bitcoind -dbcache=30000 -stopatheight=878000 -blocksdir=/magnetic/.bitcoin -addnode=local-network
@Sjors, what kind of drive is /magnetic/? (Edit: I’m thinking if the drive is a bit clunky, it will crowd out the speedup from optimizing XOR).
l0rinc
commented at 1:07 pm on January 11, 2025:
contributor
I think I managed to reproduce the ~2% difference - by not doing an IBD but a -reindex-chainstate.
@Sjors, was your datadir completely empty for the runs? My 12% comes from having nothing locally (e.g. no blocks) to being fully synced (i.e. has to include the final flush as well) - to be as close to the user’s experience as possible.
0 hyperfine \
1--runs 2 \
2--parameter-list COMMIT d73f37dda221835b5109ede1b84db2dc7c4b74a1,fe7365584bb3703e5691c93fb004772e84db3697 \
3--prepare 'git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' \
4'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -dbcache=30000 -stopatheight=878000 -reindex-chainstate -printtoconsole=0 -connect=0'
5 6Benchmark 1: COMMIT=d73f37dda221835b5109ede1b84db2dc7c4b74a1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -dbcache=30000 -stopatheight=878000 -reindex-chainstate -printtoconsole=0 -connect=0
7 Time (mean ± σ): 23664.320 s ± 111.385 s [User: 35795.225 s, System: 714.912 s]
8 Range (min … max): 23585.559 s … 23743.081 s 2 runs
910Benchmark 2: COMMIT=fe7365584bb3703e5691c93fb004772e84db3697 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -dbcache=30000 -stopatheight=878000 -reindex-chainstate -printtoconsole=0 -connect=0
11 Time (mean ± σ): 23277.741 s ± 172.333 s [User: 34509.524 s, System: 582.073 s]
12 Range (min … max): 23155.883 s … 23399.599 s 2 runs
1314Summary
15 COMMIT=fe7365584bb3703e5691c93fb004772e84db3697 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -dbcache=30000 -stopatheight=878000 -reindex-chainstate -printtoconsole=0 -connect=0 ran
16 1.02 ± 0.01 times faster than COMMIT=d73f37dda221835b5109ede1b84db2dc7c4b74a1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -dbcache=30000 -stopatheight=878000 -reindex-chainstate -printtoconsole=0 -connect=0
andrewtoth
commented at 8:47 pm on January 11, 2025:
contributor
I benchmarked this PR rebased onto 35bf426e02210c1bbb04926f4ca2e0285fbfcd11 up to block 878k two times each, and I saw a 9% speedup. This was using dbcache=30000.
0 Time (mean ± σ): 18704.258 s ± 4.425 s [User: 37215.600 s, System: 891.293 s]
1 Range (min … max): 18701.130 s … 18707.387 s 2 runs
0Summary
1 'echo 1298bae74a1f690fd6cc0e029e490537cbeb301b && /usr/bin/time ./build/src/bitcoind -printtoconsole=0 -dbcache=30000 -connect=192.168.2.171 -stopatheight=878000' ran
2 1.09 ± 0.00 times faster than 'echo 35bf426e02210c1bbb04926f4ca2e0285fbfcd11 && /usr/bin/time ./build/src/bitcoind -printtoconsole=0 -dbcache=30000 -connect=192.168.2.171 -stopatheight=878000'
Sjors
commented at 9:05 am on January 13, 2025:
member
@hodlinator a Western Digital spinning disk. Much slower than SSD but fine for just writing out blocks to disk.
(I’m assuming we’re not doing something foolish like writing the block, then reading it again to do the xor and writing it back)
@l0rinc I wiped the blocks, indexes and chainstate dirs between runs. Both times I mention exclude the 8 minute flush, which remained similar.
I agree with @maflcko that a difference of 10 minutes is probably not statistically significant. I don’t have a good offline benchmark setup.
l0rinc
commented at 9:44 am on January 13, 2025:
contributor
@Sjors, your setup is already extremely fast - it seems this optimization shines mostly on commodity hardware, which I assume is used more often.
Let’s wait for other reproducers.
0Time (mean ± σ): 64403.836 s ± 9298.080 s
1User: 57218.987 s, System: 3918.333 s
2Range (min … max): 50760.210 s … 70526.366 s
3Runs: 4
Commit: 5acf12bafe
0Time (mean ± σ): 71941.764 s ± 2646.491 s
1User: 60882.538 s, System: 3941.555 s
2Range (min … max): 68480.815 s … 74310.881 s
3Runs: 4
Summary
0COMPILER=gcc COMMIT=caa68f79c11e5c444977ce8dee8a43020b7b3c5a
1ran 1.12 ± 0.17 times faster than
2COMPILER=gcc COMMIT=5acf12bafeb126f2190b3f401f95199e0eea90c9
Bitcoin Node Setup
CPU: Intel® Core™ i5-6500T @ 2.50GHz
RAM: 16 GB
Storage: 1TB SSD (Apacer AS350)
Internet:
Download: 732.86 Mbps
Upload: 322.55 Mbps
Database Cache Adjustments
Initially, the provided database cache (DBCACHE) size was too high for the available memory on this node, causing fluctuations in execution time. To account for this variability, four runs per commit were conducted. Several configurations were tested:
DBCACHE=30000: Execution halted.
DBCACHE=10000: Execution was too slow (probably because swapping) without a significant difference between benchmarks.
DBCACHE=5000: Found to be the optimal configuration for this node.
l0rinc force-pushed
on Mar 11, 2025
l0rinc renamed this:
optimization: batch XOR operations 12% faster IBD
[IBD] multi-byte block obfuscation
on Mar 12, 2025
l0rinc marked this as ready for review
on Mar 12, 2025
DrahtBot added the label
Needs rebase
on Mar 20, 2025
l0rinc force-pushed
on Mar 20, 2025
DrahtBot removed the label
Needs rebase
on Mar 20, 2025
l0rinc force-pushed
on Mar 20, 2025
DrahtBot added the label
CI failed
on Apr 2, 2025
l0rinc force-pushed
on Apr 3, 2025
DrahtBot removed the label
CI failed
on Apr 4, 2025
l0rinc force-pushed
on Apr 5, 2025
l0rinc
commented at 5:30 pm on April 5, 2025:
contributor
As mentioned before, this PR only contains the multi-byte obfuscation now, batching the single-byte serializations is done in #31551
The PR achieves 18x faster obfuscation on Linux, 49x faster on Mac and an IBD speedup of 4%.
The latest push:
Changed the existing Xor benchmark to XorObfuscationBench, measuring a 10 MiB chunk of random memory;
dbwrapper.cpp and mempool_persist.cpp migrations are simplified by dedicated refactor commits;
streams_tests.cpp now uses the native m_rng, comments were added to the new big tests, unified key_bytes{"ff00ff00ff00ff00"_hex_v} name and storage;
Updated every measurement and commit message to reflect the current state after the split.
Thanks for the reviews so far, it’s ready for review again!
Edit: rebased in latest push to resolve CI failure
l0rinc force-pushed
on Apr 6, 2025
achow101 referenced this in commit
33df4aebae
on Apr 16, 2025
DrahtBot added the label
Needs rebase
on Apr 16, 2025
l0rinc force-pushed
on Apr 17, 2025
l0rinc
commented at 10:10 am on April 17, 2025:
contributor
Rebased, now that #31551 was merged - will redo the IBD benchmarks (since we have bigger obfuscatable chunks now) to see if any of the commit messages or descriptions need changing.
The PR is otherwise ready for review again!
DrahtBot removed the label
Needs rebase
on Apr 17, 2025
Do you feel a need to assert(m_obfuscation == 0) at the beginning of the block because the ctor has ~30 preceeding lines? Can’t think of another reason.
assert(m_obfuscation == 0) is needed to document (and to make sure) that the obfuscation key is already written with obfuscation turned on, so we have to make sure the obfuscation key is 0 to turn it off for this write only.
So without assert(m_obfuscation == 0) we might not read back old stored obfuscation keys back correctly. Does that explain it? If not, let’s have a call.
hodlinator
commented at 10:14 am on April 24, 2025:
Ah, I was interpreting “Needed for unobfuscated Read” as applying to Read-calls after the ctor, hadn’t realized it was for the call 2 lines below. :man_facepalming: That’s why I didn’t think it was relevant to include in #31144 (review). Sorry for this detour. assert is good.
Maybe comment could spell out “Needed for unobfuscated Read directly below” for similar readers to me, but hopefully there aren’t too many of them.
No problem, glad it’s sorted. Extended the comment to make it even clearer.
in
src/dbwrapper.h:190
in
46854038e7outdated
186@@ -188,16 +187,11 @@ class CDBWrapper
187 std::string m_name;
188189 //! a key used for optional XOR-obfuscation of the database
190- std::vector<unsigned char> obfuscate_key;
191+ Obfuscation m_obfuscation;
hodlinator
commented at 7:43 am on April 22, 2025:
hodlinator
commented at 9:00 am on April 22, 2025:
Commit message in 118d8083b913e130b073ed3ad0eeb5aca4887899:
0- This commit inlines the obfuscate‑key initialization, replaces `key_exists` with `key_missing`, and simplifies the `if` condition that writes a new obfuscation key.
1- The `CreateObfuscateKey` method and its private helper are removed.
2+ This commit inlines the `CreateObfuscateKey` method, replaces `key_exists` with `key_missing`, and simplifies the `if` condition that writes a new obfuscation key.
This private method is just reformatted. Maybe a case of moving changes between commits. Seems unnecessary.
“The CreateObfuscateKey method is inlined” might be more precise?
hodlinator
commented at 1:43 pm on April 23, 2025:
Seems you forgot to remove “The CreateObfuscateKey method and its private helper were also removed.”?
I think saying CreateObfuscateKey was inlined is sufficient and no private helper was removed as stated above in (1.
Taking only fixed-size spans in the interface of Obfuscation() encourages call-sites to perform error checking, or use std::array which implicitly converts if correctly sized - correct by construction. span::first() is unchecked as I said, so the current interface is unsafe as there are no checks.
I consider this thread unresolved in the current version of the PR.
I admit defeat.
Edit:
Initialized it in the tests similarly to InitBlocksdirXorKey via static extent arrays instead of vectors. This is closer to prod usage, so it’s even better
Edit: Superseded by #31144#pullrequestreview-2788618744
Didn’t get much deeper than surface level yet, but sharing what I found so far.
My primary suggestion is to change the Obfuscation-ctors to take static-extent spans to prevent accidental out-of-bounds access and also to clarify that we don’t consume bigger vectors (see inline comment). std::array of matching size maps well to such spans.
hodlinator
commented at 1:52 pm on April 22, 2025:
contributor
Changed the serialization of Obfuscation from vector -> array without re-testing my suggestions towards the end. Forgot that one serializes the size and the other does not. Seems to be responsible for some of the test failures on my suggestion-branch.
If you have a suggestion that passes ci (and local IBD for some blocks), let me know
How about CI above + local IBD on mainnet of a couple of months of blocks?
Have you ever encountered IBD failures for code that passes CI? Seems like missing test coverage in that case?
Scaled back some of the vector -> array, especially in context of serialization.
l0rinc
commented at 9:43 am on April 24, 2025:
contributor
Have you ever encountered IBD failures for code that passes CI? Seems like missing test coverage in that case?
yes, extended one of the tests now, hoping that will cover it next time.
then chided me in public
Definitely wasn’t my intention to scold you in any way, just didn’t (and still don’t) understand what you’re objecting to or suggesting in that part of the code. Pushed some changes, if it’s still not clear, let’s discuss in person.
DrahtBot added the label
CI failed
on Apr 24, 2025
DrahtBot
commented at 11:06 am on April 24, 2025:
contributor
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
l0rinc force-pushed
on Apr 24, 2025
l0rinc force-pushed
on Apr 24, 2025
l0rinc
commented at 7:06 pm on April 24, 2025:
contributor
Thanks for the reviews, addressed most of your concerns - except for the vector constructor for Obfuscation - but if other reviewers also think it’s better that way, I’ll do it of course.
Also extended the BOOST_AUTO_TEST_CASE(dbwrapper) test case with asserting that the obfuscation key can be read back by an unobfuscated instance as well.
l0rinc force-pushed
on Apr 24, 2025
l0rinc force-pushed
on Apr 24, 2025
l0rinc force-pushed
on Apr 24, 2025
hodlinator
commented at 10:13 am on April 25, 2025:
contributor
Just letting the thread know that we’ve cleared up our miscommunication regarding #31144#pullrequestreview-2788618744. The combination of my prior suggestion failing CI and @l0rinc’s cautioning to make sure my further suggestions pass CI seem to be an unlucky coincidence within a very short time-frame. We should strive towards assuming good-faith when communicating over text, but on some days frustrations can get the best of me.
l0rinc force-pushed
on Apr 25, 2025
l0rinc force-pushed
on Apr 25, 2025
l0rinc
commented at 11:34 am on April 25, 2025:
contributor
Sorry for a few useless pushes, I was fighting some compilers for strings not always being constexpr-able…
Code is ready for review again.
DrahtBot
commented at 1:26 pm on April 25, 2025:
contributor
[07:30:05.813] ⚠️ Failure generated from target with exit code 77: [’/ci_container_base/ci/scratch/build-x86_64-pc-linux-gnu/bin/fuzz’, ‘-runs=1’, PosixPath(’/ci_container_base/ci/scratch/qa-assets/fuzz_corpora/validation_load_mempool’)]
l0rinc force-pushed
on Apr 25, 2025
l0rinc
commented at 5:04 pm on April 25, 2025:
contributor
For now I’ve reverted the assert that checks the deserialized key size, since I couldn’t reproduce the fuzz failure locally (or when I could, even master failed for the same command).
Will continue trying, but without the assert it’s closer to the previous state, so it should be fine as it is as well.
DrahtBot removed the label
CI failed
on Apr 25, 2025
l0rinc force-pushed
on Apr 26, 2025
l0rinc
commented at 2:57 pm on April 26, 2025:
contributor
Restored the deserialization validation in Obfuscation::Unserialize, we can’t assert, but we can throw a std::logic_error, since during mempool fuzzing https://github.com/bitcoin/bitcoin/blob/master/src/node/mempool_persist.cpp#L141 catches and ignored these errors safely (managed to reproduce it on Linux, not sure why it’s not reproducible on Mac).
I’ve also split out all renames to a single commit before any other refactor or optimization to simplify the higher-risk changes.
PR is ready for review again!
in
src/dbwrapper.h:206
in
bba64732ffoutdated
199@@ -210,6 +200,11 @@ class CDBWrapper
200 auto& DBContext() const LIFETIMEBOUND { return *Assert(m_db_context); }
201202 public:
203+ // Prefixed with null character to avoid collisions with other keys
204+ //
205+ // We must use a string constructor which specifies length so that we copy past the null-terminator.
206+ inline static const std::string OBFUSCATION_KEY{"\000obfuscate_key", 14};
hodlinator
commented at 1:27 pm on April 30, 2025:
What is the point of changing this?
Renaming OBFUSCATE_KEY_KEY -> OBFUSCATION_KEY makes the first word nicer, but it is arguably the database key (1) for looking up the key (2) used to obfuscate (0) the data.
Making it inline might be shifting some work from the linker to the compiler (better for parallelism?). Having it all in the header is slightly nicer for humans, but seems like it would duplicate the constant where it is used. Could it just be declared as a file-local (static/anon namespace + constexpr) variable in the .CPP file instead?
If we keep this change, could it be made privateconstexpr?
Some compilers had problem with string constexpr, it’s why I inlined it instead.
I don’t mind reverting to OBFUSCATION_KEY_KEY (note OBFUSCATE -> OBFUSCATION), if you think it’s better_better.
Why change it from being private to public, some later PR?
nit: This compiles on GCC 14.2.1 and Clang 20.1.3:
0diff --git a/src/dbwrapper.cpp b/src/dbwrapper.cpp
1index ffb25b8ac1..ffe98a3fde 100644
2--- a/src/dbwrapper.cpp
3+++ b/src/dbwrapper.cpp
4@@ -32,6 +32,11 @@
5 #include <optional>
6 #include <utility>
7 8+// Prefixed with null character to avoid collisions with other keys
9+//
10+// We must use a string constructor which specifies length so that we copy past the null-terminator.
11+static constexpr std::string OBFUSCATION_KEY_KEY{"\000obfuscate_key", 14};
12+
13 static auto CharCast(const std::byte* data) { return reinterpret_cast<const char*>(data); }
1415 bool DestroyDB(const std::string& path_str)
16diff --git a/src/dbwrapper.h b/src/dbwrapper.h
17index 7a027d2ce4..8e9c6a31ba 100644
18--- a/src/dbwrapper.h
19+++ b/src/dbwrapper.h
20@@ -200,11 +200,6 @@ private:
21 auto& DBContext() const LIFETIMEBOUND { return *Assert(m_db_context); }
2223 public:
24- // Prefixed with null character to avoid collisions with other keys
25- //
26- // We must use a string constructor which specifies length so that we copy past the null-terminator.
27- inline static const std::string OBFUSCATION_KEY_KEY{"\000obfuscate_key", 14};
28-
29 CDBWrapper(const DBParams& params);
30 ~CDBWrapper();
nit in 366bffd1252a768e1161c7b632ef8c4816bb504e:
Could drop rename here from Xor -> XorObfuscationBench since next commit renames it again to ObfuscationBench.
Was done in the next commit, indeed - localized it, thanks for finding these rebase inconsistencies (likely happened when I changed the order of commits).
in
src/dbwrapper.cpp:253
in
bba64732ffoutdated
247@@ -248,24 +248,24 @@ CDBWrapper::CDBWrapper(const DBParams& params)
248 LogPrintf("Finished database compaction of %s\n", fs::PathToString(params.path));
249 }
250251- // The base-case obfuscation key, which is a noop.
252- obfuscate_key = std::vector<unsigned char>(OBFUSCATE_KEY_NUM_BYTES, '\000');
253+ {
254+ assert(m_obfuscation == 0); // Needed for unobfuscated Read() below
255+ std::vector<uint8_t> obfuscation_key_vector(Obfuscation::SIZE_BYTES, '\000');
In the !key_missing case, we’ve just read the vector from a file on disk. It could have been corrupted and got a wildly unexpected length. If it is too short we will read out of bounds here, if it is too long we will continue when we probably should be erroring out. See #31144 (review)
Valid critique - which is a gateway-suggestion forcing me to admit that we need static extend spans/arrays here :/
in
src/streams.h:250
in
bba64732ffoutdated
252- * XOR the contents of this stream with a certain key.
253- *
254- * @param[in] key The key used to XOR the data in this stream.
255- */
256- void Xor(const std::vector<unsigned char>& key)
257+ void Obfuscate(const Obfuscation& obfuscation)
Should we really continue after reading an invalid key length? Feels like it would be good to throw an exception, or assert the condition instead of if.
(Could adjust the message/comment to include the possibility of us being old software running on some future data format we don’t recognize?)
I’m against throwing in general - especially for constructors, but we’re already reading files in there and already throwing via HandleError - changed and moved to the related refactoring commit.
in
src/node/mempool_persist.cpp:65
in
b085c52030outdated
note in b085c52030ec30294f68abfdbe5b05acbae60a7e:
Seems like there’s a newish footgun with curly-brace initialization of vectors.
This compiles because the passed value unambiguously specifies the length (std::byte isn’t implicitly constructible from integer literals):
0std::vector<std::byte> obfuscation{512};
The following fails, because it’s interpreted as initializer-list construction, and the integer literal would be narrowed to fit into the first element of the vector:
0std::vector<uint8_t> obfuscation{512};
So your code is correct, but I’m actually starting to feel some anti-curly sentiment brewing for vector initialization.
I don’t think so, if this isn’t true, we’re in trouble - so it has to be the case in prod as well.
The godbold reproducer indicates this line will be eliminated in prod if all calls are demonstrably so already.
I think I had a bug here at one point so added this line to make sure it’s easily detectable.
Still this there may be a case for this change as it is a private function and the input well covered by tests. Assume() will abort on failure in BUILD_FOR_FUZZING configurations in release.
WIP: Changing this gave a 1-2% difference in benchmarks.
Re-testing now I don’t see a difference, however. Same amount of instructions/MiB on both GCC & Clang for assert/Assume.
Tried removing the check altogether and still get same amount of instructions/MiB on GCC, indicating it is able to infer that it will never fail and is without side-effects. So seems fine to keep!
in
src/obfuscation.h:52
in
989537ff40outdated
27+ {
28+ if (!*this) return;
29+ const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
30+ for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time
31+ Xor(target, rot_key, SIZE_BYTES);
32+ }
One more thing - while I was testing this out, I couldn’t find a unit test that tested obfuscating a target buffer starting at different offsets in memory. Might be good to add one?
I’m not sure about the alignment part, seems dangerous - do you see an explanation in the assembly why it would be faster?
I tried reproducing it, but I’m getting a significant slowdown instead (ran each 3x):
0voidoperator()(std::span<std::byte> target, const size_t key_offset_bytes =0) const1{
2if (!*this) return;
3constuint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
4for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time
5 Xor(target, rot_key, SIZE_BYTES);
6 }
7 Xor(target, rot_key, target.size());
8}
ns/MiB
MiB/s
err%
total
benchmark
14,543.90
68,757.33
0.1%
11.01
ObfuscationBench
14,604.52
68,471.95
0.0%
11.01
ObfuscationBench
14,573.71
68,616.72
0.2%
11.04
ObfuscationBench
0voidoperator()(std::span<std::byte> target, const size_t key_offset_bytes =0) const 1{
2if (!*this) return;
3uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
4const size_t alignment_remaining{std::min(SIZE_BYTES - (reinterpret_cast<ptrdiff_t>(target.data()) % SIZE_BYTES), target.size())};
5 Xor(target, rot_key, alignment_remaining);
6 target = target.subspan(alignment_remaining);
7 rot_key = m_rotations[(key_offset_bytes + alignment_remaining) % SIZE_BYTES];
8for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time
9*reinterpret_cast<uint64_t*>(target.data()) ^= rot_key;
10 }
11 Xor(target, rot_key, target.size());
12}
ns/MiB
MiB/s
err%
total
benchmark
19,552.98
51,143.11
0.1%
11.01
ObfuscationBench
19,529.72
51,204.01
0.2%
11.00
ObfuscationBench
20,046.53
49,883.94
0.1%
11.03
ObfuscationBench
However, looking at the assembly closer, I reiterated why gcc doesn’t do any SIMD and experimented with a few options and it seems we can help GCC by doing a big unrolled iteration before the whole thing:
which helps GCC recognise that the loop is free of dependencies and performs a 128-bit SIMD pxor over 64-byte blocks instead of doing two 64-bit scalar xors per 16-byte stride - speeding up the benchmark from the mentioned ~19 GiB/s to ~27 GiB/s - i.e. additional 40% speedup. (Clang is unaffected)
I’ll do an IBD again to see if it’s significant enough to include it here.
do you see an explanation in the assembly why it would be faster?
Didn’t check the assembly yet, do you mind sharing your workflow for that? I was using a combination of objdump with a bunch of flags + searching in less for the consteval hex stuff ~9 months ago but suspect Obfuscation-methods get heavily inlined?
Re-measured with Clang to make sure I wasn’t running the wrong build, but can still reproduce the result:
At tip of PR 989537ff4026d7c3fa5ba99701e0a4b134d950f7:
Took me longer than anticipated, but I have measured different scenarios here to understand where alignment matters and where it doesn’t, with gcc, clang and apple clang
GCC 14 is 38.5% faster with alignment & 64 byte unroll compared to just a 64 bit xor
00d77f4097 refactor: unify xor vs obfuscation nomenclature
ns/MiB
MiB/s
err%
ins/MiB
cyc/MiB
IPC
bra/MiB
miss%
total
benchmark
926,379.31
1,079.47
0.1%
6,815,747.36
3,325,871.40
2.049
524,289.23
0.0%
5.50
ObfuscationBench
926,608.63
1,079.20
0.0%
6,815,747.36
3,326,572.88
2.049
524,289.23
0.0%
5.50
ObfuscationBench
2daaa83070 optimization: migrate fixed-size obfuscation from std::vector<std::byte> to uint64_t
ns/MiB
MiB/s
err%
ins/MiB
cyc/MiB
IPC
bra/MiB
miss%
total
benchmark
74,817.84
13,365.80
0.0%
655,366.68
268,566.88
2.440
131,074.08
0.0%
5.50
ObfuscationBench
75,351.98
13,271.05
0.1%
655,366.68
270,405.84
2.424
131,074.08
0.0%
5.51
ObfuscationBench
c69f7440db experiment: aligned xor only
ns/MiB
MiB/s
err%
ins/MiB
cyc/MiB
IPC
bra/MiB
miss%
total
benchmark
38,396.57
26,043.99
0.1%
294,929.64
137,732.80
2.141
32,771.94
0.0%
5.35
ObfuscationBench
38,151.47
26,211.31
0.2%
294,929.64
136,877.12
2.155
32,771.94
0.0%
5.38
ObfuscationBench
9472af65af experiment: unroll only
ns/MiB
MiB/s
err%
ins/MiB
cyc/MiB
IPC
bra/MiB
miss%
total
benchmark
35,556.53
28,124.23
0.1%
262,148.44
127,610.00
2.054
16,384.64
0.0%
5.32
ObfuscationBench
36,409.31
27,465.50
0.1%
262,148.44
130,650.29
2.006
16,384.64
0.0%
5.33
ObfuscationBench
e9fd88bb3b experiment: unroll + alignment
ns/MiB
MiB/s
err%
ins/MiB
cyc/MiB
IPC
bra/MiB
miss%
total
benchmark
34,770.15
28,760.30
0.0%
262,149.44
124,756.60
2.101
16,384.83
0.0%
5.32
ObfuscationBench
36,214.90
27,612.94
0.0%
262,149.44
129,932.72
2.018
16,384.84
0.0%
5.33
ObfuscationBench
Combining unrolling with aligning readjusts the speed for Apple Clang (which was already doing these automatically so there’s no speed difference because of the latest commit) and brings Clang and GCC closer to the max bus speed.
Clang needed your alignment to achieve that, and GCC needed the unrolled 64-byte XOR to take advantage of SIMD.
Added this as a separate commit with you as coauthor, explaining the extra optimizations + updated godbolt and table from
Target & Compiler
Stride (per iter)
Main operation(s) in loop
Effective XORs
Clang x86-64 (trunk)
32 bytes
two unaligned 128-bit loads → pxor with broadcast key → stores
4 × 64-bit
GCC x86-64 (trunk)
16 bytes
two scalar 64-bit xor reg-mem instructions directly on the target
2 × 64-bit
GCC RV32 (trunk)
8 bytes
copy 8 B to stack scratch → XOR via two 32-bit regs → copy back
1 × 64-bit
GCC s390x (big-endian 14.2)
64 bytes
eight XC ops using key cached on stack
8 × 64-bit
to:
Target & Compiler
Stride (per hot-loop iter)
Main operation(s) in loop
Effective XORs / iter
Clang x86-64 (trunk)
64 bytes
4 × movdqu → pxor → store (aligned after peel)
8 × 64-bit
GCC x86-64 (trunk)
64 bytes
same 4 × movdqu/pxor sequence, enabled by 8-way unroll
8 × 64-bit
GCC RV32 (trunk)
8 bytes
copy 8 B to temp stack → 2 × 32-bit XOR → copy back
There was a build failure on 32 bit systems because of the alignment changes - fixed it by moving the std::assume_aligned check inside the alignment condition itself.
Remeasured the benchmarks (all 3 compilers are the same as the previous push) and redid the endianness check (as described in the PR) and updated the https://godbolt.org/z/35nveanf5 again.
hodlinator approved
hodlinator
commented at 8:25 am on May 20, 2025:
contributor
ACK989537ff4026d7c3fa5ba99701e0a4b134d950f7
Optimizing by operating on whole-word units instead of byte-for-byte makes for more efficient use of common hardware. Extracting the optimization into the Obfuscation abstraction which caches whole-word “rotations” (the XOR-key pre-rotated at all possible byte-offsets) makes sense.
Clear speedup in benchmarks. Even if the least improved of those measured (ReadBlockBench) would be the most representative of real-world use, it is sped up by ~15-16%.
Benchmarks
ObfuscationBench: 12x speedup on Clang, 17x speedup on GCC.
ReadBlockBench: 15-16% speedup on both.
ReadRawBlockBench: ~9x speedup on both.
Used ./build/bin/bench_bitcoin --filter="ObfuscationBench|ReadBlockBench|ReadRawBlockBench" --min-time=10000 on NixOS.
GCC 14.2.1
Before (06353e23e19a116a8e1862bef4b1fcfece14b445, 2nd commit in PR):
ns/MiB
MiB/s
err%
ins/MiB
cyc/MiB
IPC
bra/MiB
miss%
total
benchmark
861,980.77
1,160.12
0.2%
9,437,186.38
3,177,041.69
2.970
1,048,576.97
0.0%
10.79
ObfuscationBench
ns/op
op/s
err%
ins/op
cyc/op
IPC
bra/op
miss%
total
benchmark
5,509,795.20
181.49
0.5%
64,457,565.88
20,006,289.95
3.222
3,636,177.14
0.3%
11.03
ReadBlockBench
936,745.22
1,067.53
0.3%
9,010,771.90
3,232,301.21
2.788
1,002,214.88
0.0%
10.99
ReadRawBlockBench
After (989537ff4026d7c3fa5ba99701e0a4b134d950f7, last commit in PR):
ns/MiB
MiB/s
err%
ins/MiB
cyc/MiB
IPC
bra/MiB
miss%
total
benchmark
51,720.05
19,334.86
0.4%
327,684.05
190,464.16
1.720
65,536.55
0.0%
10.99
ObfuscationBench
ns/op
op/s
err%
ins/op
cyc/op
IPC
bra/op
miss%
total
benchmark
4,675,373.37
213.89
0.3%
55,774,507.70
16,981,281.76
3.284
2,699,256.61
0.5%
10.98
ReadBlockBench
105,697.36
9,460.97
0.2%
324,342.05
175,063.45
1.853
64,840.05
0.0%
10.99
ReadRawBlockBench
Clang Clang 20.1.3
Before (06353e23e19a116a8e1862bef4b1fcfece14b445, 2nd commit in PR):
ns/MiB
MiB/s
err%
ins/MiB
cyc/MiB
IPC
bra/MiB
miss%
total
benchmark
907,086.11
1,102.43
0.1%
6,815,747.33
3,342,513.78
2.039
524,289.21
0.0%
11.01
ObfuscationBench
ns/op
op/s
err%
ins/op
cyc/op
IPC
bra/op
miss%
total
benchmark
5,509,068.75
181.52
0.1%
61,830,240.13
19,977,540.00
3.095
3,020,308.24
0.4%
10.98
ReadBlockBench
924,286.56
1,081.92
0.3%
6,511,639.90
3,190,509.11
2.041
502,479.88
0.0%
10.97
ReadRawBlockBench
After (989537ff4026d7c3fa5ba99701e0a4b134d950f7, last commit in PR):
ns/MiB
MiB/s
err%
ins/MiB
cyc/MiB
IPC
bra/MiB
miss%
total
benchmark
75,454.29
13,253.06
0.6%
655,366.68
277,944.85
2.358
131,074.08
0.0%
10.71
ObfuscationBench
ns/op
op/s
err%
ins/op
cyc/op
IPC
bra/op
miss%
total
benchmark
4,632,323.51
215.87
0.1%
55,587,644.66
16,818,777.18
3.305
2,552,825.55
0.4%
11.00
ReadBlockBench
105,903.19
9,442.59
0.1%
262,149.05
176,010.62
1.489
33,762.05
0.1%
11.00
ReadRawBlockBench
test: compare util::Xor with randomized inputs against simple impl
Since production code only uses keys of length 8, we're not testing with other values anymore
Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>
f6689a7d9f
bench: make ObfuscationBench more representative
Since a previous PR already solved the tiny byte-array xors during serialization, we're only concentrating on big continuous chunks now.
> C++ compiler .......................... GNU 14.2.0
| ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 879,957.05 | 1,136.42 | 0.0% | 9,437,186.42 | 3,158,477.41 | 2.988 | 1,048,576.99 | 0.0% | 5.50 | `ObfuscationBench`
> C++ compiler .......................... Clang 20.1.6
| ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 926,379.31 | 1,079.47 | 0.1% | 6,815,747.36 | 3,325,871.40 | 2.049 | 524,289.23 | 0.0% | 5.50 | `ObfuscationBench`
be3435c2ff
refactor: unify xor-vs-obfuscation nomenclature
This simplifies the next commits.
2c8cb56fdd
refactor: prepare `DBWrapper` for obfuscation key change
Since `CDBWrapper::Read` still supports vector only, we can initialize `m_obfuscation` directly instead of using a separate helper.
`CreateObfuscation` was also inlined, replaced `key_exists` with `key_missing`, and simplified the `if` condition that writes a new obfuscation key.
Since we're already throwing via `HandleError` here, the obfuscation key is also validated by throwing if invalid.
1b3faf9788
refactor: prepare mempool_persist for obfuscation key changef79e8b9a35
l0rinc force-pushed
on May 23, 2025
DrahtBot added the label
CI failed
on May 23, 2025
DrahtBot
commented at 2:15 pm on May 23, 2025:
contributor
🚧 At least one of the CI tasks failed.
Task MSan, depends: https://github.com/bitcoin/bitcoin/runs/42787791556
LLM reason (✨ experimental): The CI failure is due to a failed assertion related to memory alignment during streams_tests.
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
optimization: migrate fixed-size obfuscation from `std::vector<std::byte>` to `uint64_t`
Since `util::Xor` now operates on `uint64_t` keys, migrated the obfuscation key end‑to‑end - from use sites in util::Xor up through streams, LevelDB wrappers, and disk (de)serialization - replacing all former `std::vector<std::byte>` keys with `uint64_t` (we still serialize them as vectors but convert immediately to `uint64_t` on load). This is why tests still generate vector keys and convert them to `uint64_t` later instead of generating them directly.
We also short‑circuit `Xor` calls when the key is zero to avoid unnecessary calculations (e.g., `MakeWritableByteSpan`).
In `Obfuscation::Unserialize` we can safely throw an `std::logic_error` since during mempool fuzzing `mempool_persist.cpp#L141` catches and ignored these errors safely.
> C++ compiler .......................... GNU 14.2.0
| ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 51,665.69 | 19,355.21 | 0.0% | 327,684.05 | 185,409.22 | 1.767 | 65,536.55 | 0.0% | 5.50 | `ObfuscationBench`
> C++ compiler .......................... Clang 20.1.6
| ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 74,817.84 | 13,365.80 | 0.0% | 655,366.68 | 268,566.88 | 2.440 | 131,074.08 | 0.0% | 5.50 | `ObfuscationBench`
Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>
3959a90106
optimization: peel-align head and unroll body to 64 bytes
Benchmarks indicated that processing 8-byte words instead of bytes already gives an order of magnitude speed-up, but:
* GCC still emitted scalar code;
* Clang’s auto-vectorised loop ran on the slow unaligned-load path.
Fix contains:
* peeling the mis-aligned head enabled the hot loop starting at an 8-byte address;
* `std::assume_aligned<8>` tells the optimiser the promise holds - required to keep Apple Clang happy;
* manually unrolling the body to 64 bytes enabled GCC to auto-vectorise.
> C++ compiler .......................... GNU 14.2.0
| ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 37,281.01 | 26,823.31 | 0.1% | 475,139.64 | 133,808.24 | 3.551 | 81,920.54 | 0.0% | 5.35 | `ObfuscationBench`
> C++ compiler .......................... Clang 20.1.6
| ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 34,770.15 | 28,760.30 | 0.0% | 262,149.44 | 124,756.60 | 2.101 | 16,384.83 | 0.0% | 5.32 | `ObfuscationBench`
Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>
63854e8a81
l0rinc force-pushed
on May 23, 2025
DrahtBot removed the label
CI failed
on May 23, 2025
hodlinator approved
hodlinator
commented at 9:17 am on May 28, 2025:
contributor
re-ACK63854e8a81fdef3ca99ebd339db72563d053b9d0
While the code is made less straightforward due to optimizations, I think it’s still worth doing it to minimize overhead from obfuscation.
Adapts code to help compiler with loop unrolling and SIMD.
Instead of having *reinterpret_cast<uint64_t*>(target.data()) ^= rot_key, the unrolling allows for using Obfuscation::Xor() with memcpys back & forth and still have compilers optimize effectively.
I like how this version calculates the number of initial non-aligned bytes.
std::assume_aligned - nice find!
Testing
Guix + Windows
Was curious about Windows performance, so I patched the Guix build to include bench_bitcoin.exe.
This is a metadata mirror of the GitHub repository
bitcoin/bitcoin.
This site is not affiliated with GitHub.
Content is generated from a GitHub metadata backup.
generated: 2025-06-03 09:12 UTC
This site is hosted by @0xB10C More mirrored repositories can be found on mirror.b10c.me