optimization: batch XOR operations 12% faster IBD #31144

pull l0rinc wants to merge 11 commits into bitcoin:master from l0rinc:l0rinc/optimize-xor changing 18 files +461 −302
  1. l0rinc commented at 7:11 am on October 24, 2024: contributor

    Depends on #31551.

    The descriptions and stats will be updated after the dependent PRs are already merged, but so far it seems that the 3 PRs combined can achieve a 12% IBD speedup.


    The recently merged XOR obfuscation introduced a minor serialization cost (seen during IBD benchmarking). This PR is meant to alleviate that.

    Summary

    The obfuscation’s effect on IBD seems’ to be about ~4% when it still has to be applied, and ~9% when it’s turned off, see:

    Changes in testing, benchmarking and implementation

    • Added new tests comparing randomized inputs against a trivial implementation and performing roundtrip checks with random chunks.
    • Updated the benchmark to reflect realistic scenarios by capturing every call of util::Xor for the first 860k blocks, creating a model with the same data distribution. An additional benchmark checks the effect of short-circuiting XOR when the key is zero, ensuring no speed regression occurs when the obfuscation feature is disabled.
    • Optimized the Xor function to process in batches (64/32/16/8 bits instead of per-byte).
    • Migrated remaining std::vector<std::byte>(8) values to uint64_t.

    Reproducer and assembly

    Memory alignment is handled via std::memcpy, optimized out on tested platforms (see https://godbolt.org/z/GGYcedjzY):

    • Clang (x86-64) - 32 bytes/iter using SSE vector operations
    • GCC (x86-64) - 16 bytes/iter using unrolled 64-bit XORs
    • RISC-V (32-bit) - 8 bytes/iter using load/XOR/store sequence
    • s390x (big-endian) - 64 bytes/iter with unrolled 8-byte XORs

    Endianness

    The only endianness issue was with bit rotation, intended to realign the key if obfuscation halted before full key consumption. Elsewhere, memory is read, processed, and written back in the same endianness, preserving byte order. Since CI lacks a big-endian machine, testing was done locally via Docker.

    0brew install podman pigz
    1softwareupdate --install-rosetta
    2podman machine init
    3podman machine start
    4docker run --platform linux/s390x -it ubuntu:latest /bin/bash
    5    apt update && apt install -y git build-essential cmake ccache pkg-config libevent-dev libboost-dev libssl-dev libsqlite3-dev && \
    6    cd /mnt && git clone https://github.com/bitcoin/bitcoin.git && cd bitcoin && git remote add l0rinc https://github.com/l0rinc/bitcoin.git && git fetch --all && git checkout l0rinc/optimize-xor && \
    7    cmake -B build && cmake --build build --target test_bitcoin -j$(nproc) && \
    8    ./build/src/test/test_bitcoin --run_test=streams_tests
    

    Performance

    0   cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
    1&& cmake --build build -j$(nproc) \
    2&& build/src/bench/bench_bitcoin -filter='XorHistogram|AutoFileXor' -min-time=10000
    

    The 860k block profile contains a lot of very big arrays (96'233 separate sizes, biggest was 3'992'470 bytes long) - a big departure from the previous 400k and 700k blocks (having 1500 sizes, biggest was 9319 bytes long).

    The performance characteristics are also quite different, now that we have more and bigger byte arrays:

    C++ compiler …………………….. AppleClang 16.0.0.16000026

    Before:

    ns/byte byte/s err% total benchmark
    1.00 1,000,913,427.27 0.7% 10.20 AutoFileXor
    0.85 1,173,442,964.60 0.2% 11.16 XorHistogram

    After:

    ns/byte byte/s err% total benchmark
    0.09 11,204,183,007.86 0.6% 11.08 AutoFileXor
    0.15 6,459,482,269.06 0.3% 10.97 XorHistogram

    i.e. ~11/5.5x (disabled/enabled) faster with Clang at processing the data with representative histograms.

    C++ compiler …………………….. GNU 13.2.0

    Before:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    1.87 535,253,389.72 0.0% 9.20 3.45 2.669 1.03 0.1% 11.02 AutoFileXor
    1.70 587,844,715.57 0.0% 9.35 5.41 1.729 1.05 1.7% 10.95 XorHistogram

    After:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    0.59 1,706,433,032.76 0.1% 0.00 0.00 0.620 0.00 1.8% 11.01 AutoFileXor
    0.47 2,145,375,849.71 0.0% 0.95 1.48 0.642 0.20 9.6% 10.93 XorHistogram

    i.e. ~3.2/3.5x faster (disabled/enabled) with GCC at processing the data with representative histograms.

    Before:

    ns/op op/s err% total benchmark
    2,237,168.64 446.99 0.3% 10.91 ReadBlockFromDiskTest
    748,837.59 1,335.40 0.2% 10.68 ReadRawBlockFromDiskTest

    After:

    ns/op op/s err% total benchmark
    1,827,436.12 547.21 0.7% 10.95 ReadBlockFromDiskTest
    49,276.48 20,293.66 0.2% 10.99 ReadRawBlockFromDiskTest

    Also visible on https://corecheck.dev/bitcoin/bitcoin/pulls/31144

  2. DrahtBot commented at 7:11 am on October 24, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31144.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK hodlinator, ryanofsky

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #31551 (optimization: bulk reads(27%)/writes(290%) in [undo]block [de]serialization, 6% faster IBD by l0rinc)
    • #31539 (optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization, 6% faster IBD by l0rinc)
    • #31519 (refactor: Use std::span over Span by maflcko)
    • #30965 (kernel: Move block tree db open to block manager by TheCharlatan)
    • #29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)
    • #29307 (util: explicitly close all AutoFiles that have been written by vasild)
    • #27006 (reduce cs_main scope, guard block index ’nFile’ under a local mutex by furszy)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. l0rinc force-pushed on Oct 24, 2024
  4. DrahtBot added the label CI failed on Oct 24, 2024
  5. in src/streams.h:40 in 10b9e68768 outdated
    41-    key_offset %= key.size();
    42 
    43-    for (size_t i = 0, j = key_offset; i != write.size(); i++) {
    44-        write[i] ^= key[j++];
    45+    if (size_t remaining = write.size() - i; key.size() == 8 && remaining >= 8) { // Xor in 64-bit chunks
    46+        const auto key64 = *std::bit_cast<uint64_t*>(key.data());
    


    maflcko commented at 8:27 am on October 24, 2024:
    I fail to see how this is not UB. This is identical to #30349 (review)

    maflcko commented at 8:54 am on October 24, 2024:

    Isn’t this what we’re doing in CScript as well https://github.com/bitcoin/bitcoin/blob/master/src/script/script.h#L496 ?

    no? value_type is unsigned char (an 8-bit integer type) and this one here is uint64_t (an 64-bit integer type).


    l0rinc commented at 8:55 am on October 24, 2024:
    Yes, just saw it, my mistake

    l0rinc commented at 9:17 am on October 24, 2024:
    Replaced it with memcpy and it looks like the compiler successfully simplifies it to proper vector instructions: https://godbolt.org/z/Koscjconz
  6. in src/streams.h:42 in 10b9e68768 outdated
    43-    for (size_t i = 0, j = key_offset; i != write.size(); i++) {
    44-        write[i] ^= key[j++];
    45+    if (size_t remaining = write.size() - i; key.size() == 8 && remaining >= 8) { // Xor in 64-bit chunks
    46+        const auto key64 = *std::bit_cast<uint64_t*>(key.data());
    47+        const auto size64 = remaining / 8;
    48+        for (auto& write64 : std::span(std::bit_cast<uint64_t*>(write.data() + i), size64)) {
    


    maflcko commented at 8:27 am on October 24, 2024:
    same

    maflcko commented at 8:37 am on October 24, 2024:

    CI seems to agree:

    0/usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/stl_iterator.h:1100:16: runtime error: reference binding to misaligned address 0x7f10961d9084 for type 'unsigned long', which requires 8 byte alignment
    10x7f10961d9084: note: pointer points here
    2  94 8e 20 eb 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
    3              ^ 
    4    [#0](/bitcoin-bitcoin/0/) 0x55780d85ab85 in __gnu_cxx::__normal_iterator<unsigned long*, std::span<unsigned long, 18446744073709551615ul>>::operator*() const /usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/stl_iterator.h:1100:9
    5    [#1](/bitcoin-bitcoin/1/) 0x55780d85ab85 in util::Xor(Span<std::byte>, Span<std::byte const>, unsigned long) ci/scratch/build-x86_64-pc-linux-gnu/src/test/./src/streams.h:42:28
    

    https://github.com/bitcoin/bitcoin/actions/runs/11495063168/job/31993791381?pr=31144#step:7:4647


    l0rinc commented at 8:44 am on October 24, 2024:
    Thanks, I’ll investigate. I assumed there will be more to check, that’s why it’s still a draft.
  7. maflcko commented at 8:27 am on October 24, 2024: member

    I think your example may be a bit skewed? It shows how much time is spent when deserializing a CScript from a block file. However, block files contain full blocks, where many (most?) of the writes are single bytes (or 4 bytes), see #30833 (comment). Thus, it would be useful to know what the overall end-to-end performance difference is. Also taking into account the utxo db.

    If you want the micro-benchmark to be representative, I’d presume you’d have to mimic the histogram of the sizes of writes. Just picking one (1024, or 1004), which is never hit in reality, and then optimizing for that may be misleading.

  8. l0rinc force-pushed on Oct 24, 2024
  9. l0rinc force-pushed on Oct 24, 2024
  10. l0rinc commented at 10:02 am on October 24, 2024: contributor

    where many (most?) of the writes are single bytes (or 4 bytes)

    Thanks, I’ve extended your previous benchmarks with both Autofile serialization and very small vectors. I will also run a reindex of 400k blocks before and after to see if the effect is measurable or not.

  11. in src/streams.h:44 in 6ae466bf11 outdated
    42+    if (key.size() == 8 && write.size() - i >= 8) { // Xor in 64-bit chunks
    43+        uint64_t key64;
    44+        std::memcpy(&key64, key.data(), 8);
    45+        for (; i <= write.size() - 8; i += 8) {
    46+            uint64_t write64;
    47+            std::memcpy(&write64, write.data() + i, 8);
    


    laanwj commented at 10:41 am on October 24, 2024:

    i have a hard time believing this will make a large difference, especially with the two memcpys involved On modern CPUs, ALU operations (especially bitwise ones) are so fast compared to any kind of memory access. And this isn’t some advanced crypto math, it’s one Xor operation per word with a fixed key.

    Could avoid the memcpys if the code takes memory alignment into account, but that makes it even more complex. Not sure the pros/cons work out here.


    l0rinc commented at 10:46 am on October 24, 2024:
    The speedup comes from the vectorized operations, i.e. doing 64 bit xor instead of byte-by-byte xor (memcpy seems to be eliminated on 64 bit architectures successfully), see https://godbolt.org/z/Koscjconz

    laanwj commented at 11:05 am on October 24, 2024:
    Right, that’s trivial for x86_64. Let’s also check for architectures that do require alignment for 64-bit reads and writes, like RISC-V.

    l0rinc commented at 12:06 pm on October 24, 2024:
    Added RISC-V compiler to https://godbolt.org/z/n5rMeYeas - where it seems to my untrained eyes that it uses two separate 32 bit xors to emulate the 64 bit operation (but even if it’s byte-by-byte on 32 bit processors, that’s still the same as what it was before on 64 bit CPUs, right?).

    l0rinc commented at 10:01 pm on October 26, 2024:

    I’ve replaced all vector keys with 64 bit ints

    Edit: Memory alignment is handled via std::memcpy, optimized out on tested platforms (see godbolt.org/z/dcxvh6abq):

    • Clang (x86-64) - 32 bytes/iter using SSE vector operations
    • GCC (x86-64) - 16 bytes/iter using unrolled 64-bit XORs
    • RISC-V (32-bit) - 8 bytes/iter using load/XOR/store sequence
    • s390x (big-endian) - 64 bytes/iter with unrolled 8-byte XORs

    (please validate, my assembly knowledge is mostly academic)

  12. in src/streams.h:51 in 6ae466bf11 outdated
    59-        // way instead of doing a %, which would effectively be a division
    60-        // for each byte Xor'd -- much slower than need be.
    61-        if (j == key.size())
    62-            j = 0;
    63+    for (size_t j = 0; i < write.size(); ++i, ++j) {
    64+        write[i] ^= key[j % key.size()];
    


    laanwj commented at 10:42 am on October 24, 2024:
    Please avoid % (especially with a dynamic value) in an inner loop. It’s essentially a division operation and those are not cheap.

    l0rinc commented at 10:45 am on October 24, 2024:
    Thanks for the hint, I deliberately removed that (please check the commit messages for details), since these are optimized away. Also, this is just the leftover part, so for key of length 8 (the version used in most places) this will have 7 iterations at most. Can you see any difference with any of the benchmarks?

    laanwj commented at 11:02 am on October 24, 2024:

    It’s only up to 7 iterations (assuming the key size is 8), sure, youre’re right.

    But ok, yea, i’m a bit divided about relying on specific non-trivial things being optimized out, makes the output very dependent on specific compiler decisions (which may be fickle in some cases).


    l0rinc commented at 11:05 am on October 24, 2024:
    Often the simplest code gets optimized most, since it’s more predictable. Would you like me to extend the test or benchmark suite or try something else to make sure we’re comfortable with the change?

    laanwj commented at 11:10 am on October 24, 2024:
    No, it’s fine. It just gives me goosebumps seeing code like this, but if it doesn’t affect the benchmark and no one else cares then it doesn’t matter.

    l0rinc commented at 12:09 pm on October 24, 2024:
    To get rid of the goosebumps I’m handling the remaining 4 bytes as a single 32 bit xor now, so the final loop (when keys are 8 bytes long, which is mostly the case for us, I think) does 3 iterations at most. So even if it’s not optimized away, we should be fine doing 3 divisions by a nice round number like 8.

    maflcko commented at 12:17 pm on October 24, 2024:

    divisions by a nice round number like 8.

    I don’t think the compiler knows the number here, so can’t use it to optimize the code based on it.


    l0rinc commented at 12:19 pm on October 24, 2024:
    Is that important for max 3 divisions?

    maflcko commented at 7:33 pm on October 24, 2024:

    Yes. At least for me locally, but I am getting widely different bench results anyway: #31144 (comment)

    With this one reverted, XorSmall is back to being just slightly slower than current master for me.


    l0rinc commented at 8:37 pm on October 24, 2024:

    Usually these optimizations concentrate on the measurable parts based on the profiling results that I’m getting during reindexing or IBD. Obfuscating a single bit (i.e. XorSmall) wasn’t my focus, it’s already very fast, didn’t seem like the bottleneck. Would you like me to concentrate on that scenario instead? Or would it make more sense to serialize a block and use that as the basis for the benchmarks?

    C++ compiler …………………….. AppleClang 16.0.0.16000026

    Before:

    ns/byte byte/s err% total benchmark
    1.99 501,740,412.05 0.5% 10.27 AutoFileXor
    1.24 807,597,761.92 0.0% 11.01 Xor
    1.29 776,238,564.27 0.0% 10.59 XorSmall

    After:

    ns/byte byte/s err% total benchmark
    0.73 1,364,622,484.81 0.9% 8.78 AutoFileXor
    0.02 40,999,383,920.82 0.0% 11.00 Xor
    0.54 1,862,525,472.57 0.0% 11.00 XorSmall

    C++ compiler …………………….. GNU 12.2.0

    Before:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    1.60 624,562,742.35 0.2% 9.20 3.57 2.579 1.03 0.1% 10.61 AutoFileXor
    0.91 1,102,205,071.31 0.0% 9.02 3.26 2.763 1.00 0.1% 11.00 Xor
    1.23 811,876,565.33 0.1% 14.60 4.43 3.295 1.80 0.0% 10.56 XorSmall

    After:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    0.74 1,346,798,809.87 0.1% 0.72 0.47 1.531 0.16 0.2% 11.02 AutoFileXor
    0.09 11,450,472,586.50 0.0% 0.59 0.31 1.882 0.14 1.9% 10.82 Xor
    5.65 177,092,223.60 0.1% 22.00 20.31 1.083 4.80 0.0% 10.99 XorSmall

    maflcko commented at 9:11 pm on October 24, 2024:

    Would you like me to concentrate on that scenario instead? Or would it make more sense to serialize a block and use that as the basis for the benchmarks?

    Well no. I think this has been mentioned previously. Generally, optimizing for micro benchmarks may not yield results that are actually meaningful or visible for end-users, because the benchmarks capture only a very specific and narrow view. Optimizing for one could even make the code slower for another (as observed above). Adding a bench for the block couldn’t hurt, but I haven’t checked how representative it is. If such a bench represents the IBD behavior, it would be ideal. (There already is a block in the hex bench data, which could be used)

    Usually these optimizations concentrate on the measurable parts based on the profiling results that I’m getting during reindexing or IBD

    Yes, that is more useful. It would be good to share the number you got. Because the commit message simply claims that no benefit was found (“The if (j == key.size()) optimization wasn’t kept since the benchmarks couldn’t show any advantage anymore”).

    XorSmall

    Looks like you can reproduce the slowdown. I wonder if it is correlated with the use of libc++ vs libstdc++


    l0rinc commented at 11:35 am on October 25, 2024:

    It would be good to share the number you got

    The reindex-chainstate until 600k, 2 runs just finished - comparing master against the 64/32 bit packing (current state) on Linux (with GCC, showing the above inconsistency).

    0hyperfine \
    1--runs 2 \
    2--show-output \
    3--export-json /mnt/my_storage/reindex-xor.json \
    4--parameter-list COMMIT dea7e2faf1bc48f96741ef
    584e25e6f47cefd5a92,353915bae14b9704a209bc09b021d3dd2ee11cf2 \
    6--prepare 'cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DI
    7NSTALL_MAN=OFF && cmake --build build -j$(nproc)' \
    8'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=500 -printtoconsole=0 -reindex-chainstate -connect=0'
    

    Before:

    • 3.554 hours
    • 3.567 hours

    After:

    • 3.523 hours
    • 3.527 hours
    0Benchmark 1: COMMIT=dea7e2faf1bc48f96741ef84e25e6f47cefd5a92 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=500 -printtoconsole=0 -reindex-chains
    1tate -connect=0
    2  Time (mean ± σ):     12819.367 s ± 35.155 s    [User: 11992.168 s, System: 2509.200 s]
    3  Range (min … max):   12794.508 s … 12844.225 s    2 runs
    4
    5Benchmark 2: COMMIT=353915bae14b9704a209bc09b021d3dd2ee11cf2 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=500 -printtoconsole=0 -reindex-chains
    6tate -connect=0
    7  Time (mean ± σ):     12685.350 s ± 19.878 s    [User: 11918.349 s, System: 2523.819 s]
    8  Range (min … max):   12671.295 s … 12699.406 s    2 runs
    

    Reindexing is a lot more stable than IBD (as seen from multiple measurements), showing a consistent 1% speedup. Not earth-shattering, but at least this way the obfuscation isn’t causing a speed regression anymore.


    Adding a bench for the block couldn’t hurt, but I haven’t checked how representative it is

    Let’s find out!


    l0rinc commented at 10:03 pm on October 26, 2024:
    I’ve change the usages to avoid std::vector keys, this way GCC and Clang both agree that the new results are faster (even though clang manages to compile to 32 byte SIMD, while GCC only to 16 bytes per iteration, see #31144 (review))
  13. DrahtBot removed the label CI failed on Oct 24, 2024
  14. l0rinc force-pushed on Oct 24, 2024
  15. l0rinc force-pushed on Oct 24, 2024
  16. l0rinc renamed this:
    optimization: pack util::Xor into 64 bit chunks instead of doing it byte-by-byte
    optimization: pack util::Xor into 64/32 bit chunks instead of doing it byte-by-byte
    on Oct 24, 2024
  17. l0rinc marked this as ready for review on Oct 24, 2024
  18. laanwj added the label Block storage on Oct 24, 2024
  19. maflcko commented at 7:14 pm on October 24, 2024: member

    It would be good to explain the jpegs in the description, or even remove them. They will be excluded from the merge commit and aren’t shown, unless GitHub happens to be reachable and online. Are they saying that IBD was 4% faster? Also, I think they were created with the UB version of this pull, so may be outdated either way?

    I did a quick check on my laptop and it seems the XorSmall (1+4 bytes) is slower with this pull. The Xor (modified to check 40 bytes) was twice as fast. Overall, I’d expect it to be slower on my machine, due to the histogram of real data showing more small byte writes than long ones, IIRC.

    I can try to bench on another machine later, to see if it makes a difference.

    Can you clarify what type of machine you tested this on?

  20. l0rinc commented at 8:39 pm on October 24, 2024: contributor

    Are they saying that IBD was 4% faster?

    That’s what I’m measuring currently, but I don’t expect more than 2% difference here.

    Also, I think they were created with the UB version of this pull, so may be outdated either way?

    Benchmarks indicated that the 64 bit compiled result was basically the same.

    Overall, I’d expect it to be slower on my machine, due to the histogram of real data showing more small byte writes than long ones, IIRC.

    I’ll investigate, thanks.

    Posting the perf here for reference: Reindexing until 300k blocks reveals that XOR usage was reduced:

  21. in src/test/streams_tests.cpp:379 in a3dc138798 outdated
    296@@ -270,7 +297,7 @@ BOOST_AUTO_TEST_CASE(streams_buffered_file)
    297         BOOST_CHECK(false);
    298     } catch (const std::exception& e) {
    299         BOOST_CHECK(strstr(e.what(),
    300-                        "Rewind limit must be less than buffer size") != nullptr);
    301+                           "Rewind limit must be less than buffer size") != nullptr);
    


    hodlinator commented at 12:41 pm on October 25, 2024:
    Seems like prior author just rounded off to some not too unreasonable tab-indentation (efd2474d17098c754367b844ec646ebececc7c74). Function isn’t touched in this PR so should probably resist touching here and below.

    l0rinc commented at 2:18 pm on October 25, 2024:
    Seems it was a deliberate formatting, so I’ll revert. Will push after I have the block serialization benchmark working.

    l0rinc commented at 9:58 pm on October 26, 2024:
    Ended up modifying a lot more in the latest push, so this line was formatted again
  22. in src/test/streams_tests.cpp:35 in a3dc138798 outdated
    30+        const size_t write_size = rng.randrange(100);
    31+        const size_t key_offset = rng.randrange(key_size + 2);
    32+
    33+        std::vector key(rng.randbytes<std::byte>(key_size));
    34+        std::vector expected(rng.randbytes<std::byte>(write_size));
    35+        std::vector actual(expected);
    


    hodlinator commented at 12:46 pm on October 25, 2024:

    Experimented with changed to brace-initialization which uncovered some slight narrowing/widening occurring. (Thought I had an angle for making the code more robust in a more material way but that attempt failed).

     0    auto expected_xor{[](Span<std::byte> write, Span<const std::byte> key, size_t key_offset) {
     1        if (key.size()) {
     2            for (auto& b : write) {
     3                b ^= key[key_offset++ % key.size()];
     4            }
     5        }
     6    }};
     7
     8    FastRandomContext rng{false};
     9    for (int test = 0; test < 100; ++test) {
    10        const int key_size{rng.randrange(10)};
    11        const int write_size{rng.randrange(100)};
    12        const int key_offset{rng.randrange(key_size + 2)};
    13
    14        std::vector key{rng.randbytes<std::byte>(key_size)};
    15        std::vector expected{rng.randbytes<std::byte>(write_size)};
    16        std::vector actual{expected};
    

    l0rinc commented at 2:18 pm on October 25, 2024:
    Seems a bit excessive for a test, but ended up changing it to e.g. const size_t write_size{rng.randrange(100U)};. Will push a bit later.
  23. in src/test/streams_tests.cpp:19 in a3dc138798 outdated
    13@@ -14,6 +14,33 @@ using namespace std::string_literals;
    14 
    15 BOOST_FIXTURE_TEST_SUITE(streams_tests, BasicTestingSetup)
    16 
    17+BOOST_AUTO_TEST_CASE(xor_bytes)
    18+{
    19+    auto expected_xor = [](Span<std::byte> write, Span<const std::byte> key, size_t key_offset) {
    


    hodlinator commented at 1:12 pm on October 25, 2024:
    Might as well use std::span for new code.

    l0rinc commented at 2:18 pm on October 25, 2024:
    Indeed!
  24. hodlinator commented at 2:02 pm on October 25, 2024: contributor

    Concept ACK a3dc138798e2f2c7aa1c9b37633c16c1b523a251

    Operating on CPU words rather than individual bytes. :+1:

    Not entirely clear to me from #31144 (review) whether the optimizer is able to use SIMD. Guess picking through the binary of a GUIX-build would give a definitive answer.

    The verbosity of std::memcpy hurts readability but alignment issues are real.

  25. l0rinc marked this as a draft on Oct 25, 2024
  26. l0rinc force-pushed on Oct 26, 2024
  27. l0rinc force-pushed on Oct 26, 2024
  28. l0rinc force-pushed on Oct 26, 2024
  29. DrahtBot commented at 9:58 pm on October 26, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32109998785

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  30. DrahtBot added the label CI failed on Oct 26, 2024
  31. l0rinc force-pushed on Oct 26, 2024
  32. l0rinc renamed this:
    optimization: pack util::Xor into 64/32 bit chunks instead of doing it byte-by-byte
    optimization: change XOR obfuscation key from `std::vector<std::byte>(8)` to `uint64_t`
    on Oct 26, 2024
  33. l0rinc force-pushed on Oct 27, 2024
  34. DrahtBot removed the label CI failed on Oct 27, 2024
  35. l0rinc force-pushed on Oct 27, 2024
  36. l0rinc force-pushed on Oct 27, 2024
  37. l0rinc marked this as ready for review on Oct 27, 2024
  38. l0rinc force-pushed on Oct 29, 2024
  39. l0rinc force-pushed on Oct 29, 2024
  40. DrahtBot added the label CI failed on Oct 29, 2024
  41. DrahtBot commented at 12:49 pm on October 29, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32217592364

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  42. DrahtBot removed the label CI failed on Oct 29, 2024
  43. hodlinator commented at 12:40 pm on October 31, 2024: contributor

    The CI failure in https://github.com/bitcoin/bitcoin/runs/32217592364 might come from a bad rebase reconciliation with master?

    0[12:44:30.723] Duplicate include(s) in src/bench/xor.cpp:
    1[12:44:30.723] #include <cstddef>
    2[12:44:30.723] #include <span.h>
    3[12:44:30.723] #include <streams.h>
    4[12:44:30.723] #include <random.h>
    5[12:44:30.723] #include <vector>
    6[12:44:30.723] #include <bench/bench.h>
    7[12:44:30.723] 
    8[12:44:30.728] ^---- ⚠️ Failure generated from lint-includes.py
    
  44. l0rinc commented at 12:59 pm on October 31, 2024: contributor

    The CI failure in bitcoin/bitcoin/runs/32217592364 might come from a bad rebase reconciliation with master?

    That’s an old build, right? Latest Lint seems fine

  45. in src/dbwrapper.cpp:259 in 8696184c2f outdated
    263-        // Initialize non-degenerate obfuscation if it won't upset
    264-        // existing, non-obfuscated data.
    265-        std::vector<unsigned char> new_key = CreateObfuscateKey();
    266+    obfuscate_key = 0; // needed for unobfuscated Read
    267+    std::vector<unsigned char> obfuscate_key_vector(OBFUSCATE_KEY_NUM_BYTES, '\000');
    268+    if (bool key_exists = Read(OBFUSCATE_KEY_KEY, obfuscate_key_vector); !key_exists && params.obfuscate && IsEmpty()) {
    


    hodlinator commented at 1:09 pm on October 31, 2024:

    While key_exists is only created for this ìf`-block, there are other conditions involved, and we test for the negation of the value, so I find it less surprising to revert to the previous approach.

    0    bool key_exists = Read(OBFUSCATE_KEY_KEY, obfuscate_key_vector);
    1    if (!key_exists && params.obfuscate && IsEmpty()) {
    

    l0rinc commented at 2:57 pm on October 31, 2024:
    This pattern has been used before, I meant to narrow the scope of the var here. If you feel strongly about it, I’ll separate them.

    hodlinator commented at 3:09 pm on October 31, 2024:

    This pattern has been used before

    In the example you give, the variable is used in a more “positive” sense. Here we test the negated value, which is part of what makes it jarring.

    One could maybe improve readability by moving the negation and renaming:

    0    if (bool missing_key = !Read(OBFUSCATE_KEY_KEY, obfuscate_key_vector); missing_key && params.obfuscate && IsEmpty()) {
    

    I meant to narrow the scope of the var here.

    That’s why I was referring to the if-block.

    If you feel strongly about it, I’ll separate them.

    Yes please, either that or my negation move+rename suggestion.


    l0rinc commented at 3:13 pm on October 31, 2024:
    Done both, thanks, good observation about the negation

    hodlinator commented at 3:48 pm on October 31, 2024:

    My point about moving the negation and changing the name made more sense in the context of keeping it inside the if-block. If you are open to moving it out, I’d say it’s better to keep the original key_exists name and original negation to avoid the churn and make it easier to review.

    (Realized another reason for not having it inside the if-block is that we are mutating obfuscate_key_vector, which is used after the block).


    l0rinc commented at 5:43 pm on October 31, 2024:
    I prefer the new key_missing part, it did confuse me once during development as well
  46. in src/node/blockstorage.cpp:1180 in 8696184c2f outdated
    1173@@ -1174,7 +1174,11 @@ static auto InitBlocksdirXorKey(const BlockManager::Options& opts)
    1174         };
    1175     }
    1176     LogInfo("Using obfuscation key for blocksdir *.dat files (%s): '%s'\n", fs::PathToString(opts.blocks_dir), HexStr(xor_key));
    1177-    return std::vector<std::byte>{xor_key.begin(), xor_key.end()};
    1178+    assert(xor_key.size() == 8);
    1179+    uint64_t key;
    1180+    std::memcpy(&key, xor_key.data(), 8);
    1181+    xor_key.fill(std::byte{0});
    


    hodlinator commented at 1:16 pm on October 31, 2024:

    I find it more useful to (static) assert that the std::array size matches the uint64 size directly. Also don’t see a point in zeroing out the local variable before returning?

    0    uint64_t key;
    1    static_assert(xor_key.size() == sizeof(key));
    2    std::memcpy(&key, xor_key.data(), sizeof(key));
    

    l0rinc commented at 2:59 pm on October 31, 2024:
    I’ve added the fills to make sure we’re not using them after conversion anymore. What would be the advantage of the static asserts? I don’t mind removing these failsafes if you think they’re redundant or noisy.

    l0rinc commented at 3:15 pm on October 31, 2024:
    Removed in the end, not that important
  47. in src/bench/xor.cpp:7 in 8696184c2f outdated
    2@@ -3,22 +3,1283 @@
    3 // file COPYING or https://opensource.org/license/mit/.
    4 
    5 #include <bench/bench.h>
    6+#include <cmath>
    7+#include <cstddef>
    8+#include <map>
    


    fanquake commented at 1:16 pm on October 31, 2024:
    You can keep the std lib headers separated from our own headers.

    l0rinc commented at 1:55 pm on October 31, 2024:
    Done, anything else?
  48. maflcko commented at 1:18 pm on October 31, 2024: member

    Taking a step back, I wonder if this is worth it. IIRC it gives a +1% speedup when run on spinning storage, so it seems a higher speedup is possibly visible on faster, modern storage. However, it would be good if any IO delay was taken out of IBD completely, so that the speed of storage and the speed of XOR is largely irrelevant.

    I haven’t looked, but this may be possible by asking the next block the be read into memory in the background, as soon as work on the current block begins.

    Something similar is done in net_processing, where the (presumed) current block is read into memory, and kept there, to avoid having to read it from storage again in validation. See https://github.com/bitcoin/bitcoin/blob/f07a533dfcb172321972e5afb3b38a4bd24edb87/src/net_processing.cpp#L823

  49. in src/node/blockstorage.cpp:1177 in 23fc898514 outdated
    1173@@ -1174,6 +1174,7 @@ static auto InitBlocksdirXorKey(const BlockManager::Options& opts)
    1174         };
    1175     }
    1176     LogInfo("Using obfuscation key for blocksdir *.dat files (%s): '%s'\n", fs::PathToString(opts.blocks_dir), HexStr(xor_key));
    1177+    assert(xor_key.size() == 8);
    


    hodlinator commented at 1:20 pm on October 31, 2024:
    Don’t see much point in adding the assert here in 23fc898514bf9696facbaff65251b62c362d214e were we still only have a fixed-size std::array with the asserted fixed size of 8. Seems sufficient with the assert in the BlockManager-ctor.

    l0rinc commented at 3:13 pm on October 31, 2024:
    Simplified, thanks
  50. l0rinc commented at 1:24 pm on October 31, 2024: contributor

    Taking a step back, I wonder if this is worth it. IIRC it gives a +1% speedup

    That’s not the main point, rather that we’re storing the key in a value that can be short-circuited easily so that when key is 0 (i.e. xor is NOOP), we can skip it. Previously this would have only been possible by checking each byte of the key. It’s also a lot cleaner to store it in a primitive instead, which supports xor natively. Xor comes up in every profiling I do, we shouldn’t have a regression because of #28207 - this PR solves that.

  51. in src/test/streams_tests.cpp:262 in 850214ffd9 outdated
    261@@ -235,7 +262,7 @@ BOOST_AUTO_TEST_CASE(streams_serializedata_xor)
    262     // Single character key
    


    hodlinator commented at 1:29 pm on October 31, 2024:
    In 850214ffd9f56e887a18d0428d5881e6c1ee8652: Single/Multi character key comments don’t make sense inside of this commit.

    l0rinc commented at 3:13 pm on October 31, 2024:
    Removed, thanks

    hodlinator commented at 3:39 pm on October 31, 2024:
    (Should be done in the initial commit which invalidates the comments IMO).
  52. maflcko commented at 1:39 pm on October 31, 2024: member

    we shouldn’t have a regression because of #28207 - this PR solves that.

    It is not possible to do XOR without any cost at all. There will always be an overhead and I think calling #28207 a regression and this change a “fix” is not entirely accurate. This change reduces the overhead, according to the benchmarks.

    That’s not the main point

    The pull request title starts with “optimization”, so I got the impression that a speedup is the main point.

    The reason that std::vector was picked is that switching to larger sizes is possible. However, if there was a need to do that, XOR would likely not be sufficient anyway. So limiting to 8 bytes fixed at compile time seems reasonable.

    I am mostly saying that any speedup here may not be visible at all if IO is completely taken out of the critical path, but I haven’t looked into that in detail.

  53. l0rinc commented at 1:52 pm on October 31, 2024: contributor

    change a “fix” is not entirely accurate

    when xor is disabled we’re not xor-ing now at all. Previously we did xor, so this change brings back the previous behavior when xor is not needed.

    The pull request title starts with “optimization”, so I got the impression that a speedup is the main point.

    Yes, please see the updated description with the benchmarks: #31144#issue-2610689777

    may not be visible at all if IO is completely taken out of the critical path

    I’d argue the current implementation is slightly simpler (i.e. xor is stored and performed natively and can be disabled) and faster (2x for a representative dataset).

  54. l0rinc force-pushed on Oct 31, 2024
  55. in src/streams.cpp:14 in 4437d4379c outdated
    10@@ -11,6 +11,7 @@
    11 AutoFile::AutoFile(std::FILE* file, std::vector<std::byte> data_xor)
    12     : m_file{file}, m_xor{std::move(data_xor)}
    13 {
    14+    assert(m_xor.size() == 8);
    


    hodlinator commented at 2:15 pm on October 31, 2024:
    Guess the point of adding this assert is to prove that it doesn’t break anything before switching to uint64?

    l0rinc commented at 3:01 pm on October 31, 2024:
    yes
  56. in src/streams.h:408 in 4437d4379c outdated
    404@@ -393,7 +405,7 @@ class AutoFile
    405     std::optional<int64_t> m_position;
    406 
    407 public:
    408-    explicit AutoFile(std::FILE* file, std::vector<std::byte> data_xor={});
    409+    explicit AutoFile(std::FILE* file, std::vector<std::byte> data_xor = {std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}});
    


    hodlinator commented at 2:27 pm on October 31, 2024:
    In 4437d4379c42a7b87bd01ad5ea6c450a732f4f95: Less verbose: {std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}} => {8, std::byte{0x00}}

    l0rinc commented at 3:01 pm on October 31, 2024:
    it’s removed in the next commit

    l0rinc commented at 3:13 pm on October 31, 2024:
    But done anyway
  57. in src/node/mempool_persist.cpp:189 in 33625e9d53 outdated
    188-            file.SetXor(xor_key);
    189+            std::vector<std::byte> xor_key_vector(8);
    190+            FastRandomContext{}.fillrand(xor_key_vector);
    191+            file << xor_key_vector;
    192+
    193+            uint64_t m_xor;
    


    hodlinator commented at 2:46 pm on October 31, 2024:
    Should not use m_-prefix for non-member variables.

    l0rinc commented at 3:13 pm on October 31, 2024:
    You’re right, I’ve used xor_key in the same case before
  58. l0rinc force-pushed on Oct 31, 2024
  59. hodlinator commented at 4:00 pm on October 31, 2024: contributor

    (Reviewed a5cad729c76cafa047a2b1897595669ae9b2b0d5)

    Since my previous Concept ACK, the PR was changed to switch the xor key more completely to uint64_t. Before the PR, we were already using fixed-size of 8 bytes for the obfuscation value in the file formats, so changing the type to uint64_t shouldn’t be noticeable to users. :+1:

    Even if we could move reading and XOR-ing out of the hot path as suggested by maflcko, we might as well make use the CPU architectures we have. I would expect larger-sized XOR operations to have less overhead and energy waste (less heat).

  60. maflcko commented at 5:32 pm on October 31, 2024: member

    We could have used ReadLE64 to unify the byte order for keys and writable values, but that shouldn’t be necessary, since both have the same endianness locally that shouldn’t be affected by a byte-by-byte xor.

    The s390x unit tests fail:

     0./src/test/streams_tests.cpp(40): error: in "streams_tests/xor_bytes": check { expected.begin(), expected.end() } == { actual.begin(), actual.end() } has failed. 
     1Mismatch at position 0: � != |
     2Mismatch at position 1: Y != �
     3Mismatch at position 2: � != �
     4Mismatch at position 3: � != �
     5Mismatch at position 4: w != �
     6Mismatch at position 5: C != �
     7Mismatch at position 6:  != x
     8Mismatch at position 7: � != �
     9Mismatch at position 8: � != C
    10Mismatch at position 9: Y != �
    11Mismatch at position 10: � != �
    12Mismatch at position 11: , != �
    13Mismatch at position 12: � != R
    14Mismatch at position 13: 8 != �
    15Mismatch at position 14: � != �
    16Mismatch at position 15: � != �
    17Mismatch at position 16: � != l
    18Mismatch at position 17: � != n
    19Mismatch at position 18: � != �
    20Mismatch at position 19: t != B
    21Mismatch at position 20: ; != �
    22Mismatch at position 21: != �
    23Mismatch at position 22: � != �
    24Mismatch at position 23: � != �
    25Mismatch at position 24: k != �
    26Mismatch at position 25: � != Z
    27Mismatch at position 26: � != �
    28Mismatch at position 27: � != �
    29Mismatch at position 28: � != #
    30Mismatch at position 29: 8 != �
    31Mismatch at position 30: � != �
    32Mismatch at position 31: � != �
    33Mismatch at position 32: g != �
    34Mismatch at position 33: � != ^
    35Mismatch at position 34: � != �
    36ismatch at position 36: � != k
    37Mismatch at position 37: * != �
    38Mismatch at position 38: q != 
    39Mismatch at position 39: � != �
    40Mismatch at position 40: � != �
    41Mismatch at position 41: r != e
    42Mismatch at position 42: � != �
    
  61. l0rinc commented at 5:42 pm on October 31, 2024: contributor

    The s390x unit tests fail:

    I don’t know how to access that, is it part of CI? Does the test suite pass on it otherwise or was it just curiosity? Do you know the reason it fails, is my assumption incorrect that endianness should apply to both parts (key and value) the same way? Is the test wrong or the xor, should I change the test such that xor-ing twice should reveal the original data instead (while the intermediary part should not)?

  62. maflcko commented at 6:37 pm on October 31, 2024: member

    I don’t know how to access that, is it part of CI?

    It needs to be run manually. See https://github.com/bitcoin/bitcoin/tree/master/ci#running-a-stage-locally. (podman run --rm --privileged docker.io/multiarch/qemu-user-static --reset -p yes may be required to setup qemu-s390x, depending on your setup). Then something like MAKEJOBS="-j$(nproc)" FILE_ENV="./ci/test/00_setup_env_s390x.sh" ./ci/test_run_all.sh should run it.

    Does the test suite pass on it otherwise or was it just curiosity?

    Yes, it should pass on s390x. If not, that is a bug somewhere.

  63. l0rinc marked this as a draft on Oct 31, 2024
  64. l0rinc force-pushed on Nov 2, 2024
  65. l0rinc commented at 7:45 pm on November 2, 2024: contributor

    Thanks for the hints @maflcko, I was under the impression that big-endian tests were run automatically.

    Fix

    It seem that that std::rotr doesn’t take endianness into account, so the fix basically looks like:

    0size_t key_rotation = 8 * key_offset;
    1if constexpr (std::endian::native == std::endian::big) key_rotation *= -1;
    2return std::rotr(key, key_rotation);
    

    I’ve emulated the s390x behavior locally like this:

    0brew install podman pigz
    1softwareupdate --install-rosetta
    2podman machine init
    3podman machine start
    4docker run --platform linux/s390x -it ubuntu:22.04 /bin/bash
    5    apt update && apt install -y git build-essential cmake ccache pkg-config libevent-dev libboost-dev libssl-dev libsqlite3-dev && \
    6    cd /mnt && git clone https://github.com/bitcoin/bitcoin.git && cd bitcoin && git remote add l0rinc https://github.com/l0rinc/bitcoin.git && git fetch --all && git checkout l0rinc/optimize-xor && \
    7    cmake -B build && cmake --build build --target test_bitcoin -j$(nproc) && \
    8    ./build/src/test/test_bitcoin --run_test=streams_tests
    

    Changes

    • The change also includes an updated benchmarking suite with 700k blocks inspected for all usages of utils::Xor to make the benchmark representative:

    • Changed AutoFileXor benchmark to measure the regression of turning off obfuscation.

    • I’ve also updated all key derivations to vector-to-uint64 instead of generating the 64 bit key directly (it’s more consistent, more representative and helps with endianness).

    • Added a xor roundtrip test which applies the xor in random chunks, asserts that it differs from the original and reapplies it in different random chunks and asserts that it’s equivalent to the original.

    See the complete diff: https://github.com/bitcoin/bitcoin/compare/a5cad729c76cafa047a2b1897595669ae9b2b0d5..af778c5bdea4175f84e82b41f3b51c7f453d8c7e

  66. l0rinc marked this as ready for review on Nov 2, 2024
  67. l0rinc force-pushed on Nov 5, 2024
  68. l0rinc commented at 4:40 pm on November 5, 2024: contributor

    Updated the benchmark to 860k blocks (September 2024):

    This one contains a lot of very big arrays (96'233 separate sizes, biggest was 3'992'470 bytes long) - a big departure from the previous 400k and 700k blocks (having 1500 sizes, biggest was 9319 bytes long).

    The performance characteristics are also quite different, now that we have more and bigger byte arrays:

    C++ compiler …………………….. AppleClang 16.0.0.16000026

    Before:

    ns/byte byte/s err% total benchmark
    1.29 774,577,944.12 0.2% 115.99 XorHistogram

    After:

    ns/byte byte/s err% total benchmark
    0.04 26,411,646,837.32 0.2% 8.97 XorHistogram

    i.e. ~35x faster with Clang at processing the data with representative histograms.

    C++ compiler …………………….. GNU 13.2.0

    Before:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    0.97 1,032,916,679.87 0.0% 9.01 3.29 2.738 1.00 0.0% 86.58 XorHistogram

    After:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    0.10 10,369,097,976.62 0.0% 0.32 0.33 0.985 0.06 0.6% 8.63 XorHistogram

    i.e. ~10x faster with GCC at processing the data with representative histograms.

    Edit: note that I couldn’t use random byte generation for each benchmark value since it timed out on CI. I have replaced it with getting subsets of a single big random vector.

  69. l0rinc force-pushed on Nov 5, 2024
  70. DrahtBot added the label CI failed on Nov 5, 2024
  71. DrahtBot commented at 4:49 pm on November 5, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32549559124

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  72. l0rinc force-pushed on Nov 6, 2024
  73. l0rinc force-pushed on Nov 6, 2024
  74. l0rinc force-pushed on Nov 7, 2024
  75. l0rinc force-pushed on Nov 8, 2024
  76. DrahtBot removed the label CI failed on Nov 8, 2024
  77. l0rinc commented at 10:59 am on November 11, 2024: contributor

    I ran a reindex-chainstate until 860k blocks (SSD, i7 CPU), before and after this change, 2 runs per commit. As stated in the previous comment, the latter blocks (>700k) seem to contain a lot of very big vectors where the new algorithm shines.

    0hyperfine   \
    1--runs 2   \
    2--export-json /mnt/my_storage/reindex-chainstate-obfuscation-overhead.json  \
    3--parameter-list COMMIT 866f4fa521f6932162570d6531055cc007e3d0cd,3efc72ff7cbdfb788b23bf4346e29ba99362c120,08db1794647c37f966c525411f931a4a0f6b6119 \
    4--prepare 'git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' \
    5'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0 -reindex-chainstate'
    

    Which indicates that this change speeds up IBD (or at least reindex-chainstate) by roughly ~3% (i.e. reverts the regression):

    • 866f4fa521f6932162570d6531055cc007e3d0cd - baseline
    • 3efc72ff7cbdfb788b23bf4346e29ba99362c120 - end-to-end vector to uint64
    • 08db1794647c37f966c525411f931a4a0f6b6119 - dummy commit that forces obfuscation keys to be 0 to ignore xor operations (since -blocksxor=0 doesn’t affect dbwrapper), to see how xor affects speed in general

    Benchmark 1: COMMIT=866f4fa521f6932162570d6531055cc007e3d0cd ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0 -reindex-chainstate

    Time (mean ± σ): 20846.396 s ± 72.227 s [User: 22353.773 s, System: 2563.864 s] Range (min … max): 20795.323 s … 20897.468 s 2 runs

    Benchmark 2: COMMIT=3efc72ff7cbdfb788b23bf4346e29ba99362c120 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0 -reindex-chainstate

    Time (mean ± σ): 20308.036 s ± 100.345 s [User: 21745.076 s, System: 2591.459 s] Range (min … max): 20237.081 s … 20378.991 s 2 runs

    Benchmark 3: COMMIT=08db1794647c37f966c525411f931a4a0f6b6119 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0 -reindex-chainstate

    Time (mean ± σ): 20293.227 s ± 9.923 s [User: 21711.799 s, System: 2588.886 s] Range (min … max): 20286.210 s … 20300.244 s 2 runs

    0Summary
    1'COMMIT=08db1794647c37f966c525411f931a4a0f6b6119 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0 -reindex-chainstate' ran
    2 1.00 ± 0.00 times faster than 'COMMIT=3efc72ff7cbdfb788b23bf4346e29ba99362c120 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0 -reindex-chainstate'
    3 1.03 ± 0.00 times faster than 'COMMIT=866f4fa521f6932162570d6531055cc007e3d0cd ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0 -reindex-chainstate'
    
  78. in src/test/streams_tests.cpp:301 in a123297318 outdated
    299+        uint64_t xor_key;
    300+        std::memcpy(&xor_key, xor_pat.data(), 8);
    301+
    302         DataStream ds{in};
    303-        ds.Xor({0xff});
    304+        ds.Xor(xor_key);
    


    hodlinator commented at 8:26 pm on November 11, 2024:

    Let’s not add bloat for this case.

    0        ds.Xor(0xffffffffffffffff);
    

    l0rinc commented at 9:59 pm on November 11, 2024:
    Please see: #31144 (review)

    hodlinator commented at 10:26 pm on November 11, 2024:
    All ones (binary) is not endian-sensitive. IMO it’s okay for the tests to look slightly different to reduce this kind of noise.

    l0rinc commented at 10:28 pm on November 11, 2024:
    I want to test the setup we have in production, these tests were very useful in revealing endianness problems.
  79. in src/test/streams_tests.cpp:315 in a123297318 outdated
    316+        uint64_t xor_key;
    317+        std::memcpy(&xor_key, xor_pat.data(), 8);
    318+
    319         DataStream ds{in};
    320-        ds.Xor({0xff, 0x0f});
    321+        ds.Xor(xor_key);
    


    hodlinator commented at 8:32 pm on November 11, 2024:

    Should work regardless of endianness.

    0        ds.Xor(0xff0fff0fff0fff0f);
    

    l0rinc commented at 9:56 pm on November 11, 2024:
    It doesn’t (at least not for all test cases), I’m deliberately testing generating vectors and converting them instead of generating 64 bit values since that’s what happens in production code (this should address multiple of your comments)

    hodlinator commented at 10:45 pm on November 11, 2024:
  80. in src/test/streams_tests.cpp:82 in a123297318 outdated
    78     const std::vector<uint8_t> test1{1, 2, 3};
    79     const std::vector<uint8_t> test2{4, 5};
    80-    const std::vector<std::byte> xor_pat{std::byte{0xff}, std::byte{0x00}};
    81+    const std::vector xor_pat{std::byte{0xff}, std::byte{0x00}, std::byte{0xff}, std::byte{0x00}, std::byte{0xff}, std::byte{0x00}, std::byte{0xff}, std::byte{0x00}};
    82+    uint64_t xor_key;
    83+    std::memcpy(&xor_key, xor_pat.data(), 8);
    


    hodlinator commented at 8:35 pm on November 11, 2024:
    0    const uint64_t xor_pat{0xff00ff00ff00ff00};
    

    l0rinc commented at 9:56 pm on November 11, 2024:
    Please see: #31144 (review)

    hodlinator commented at 10:35 pm on November 11, 2024:
    Feel like I’m imitating ChatGPT: Ah, yes, I get now how the memcpy() interaction with endianness preserves the functioning of this specific test.
  81. in src/dbwrapper.cpp:257 in a123297318 outdated
    261-
    262-    if (!key_exists && params.obfuscate && IsEmpty()) {
    263-        // Initialize non-degenerate obfuscation if it won't upset
    264-        // existing, non-obfuscated data.
    265-        std::vector<unsigned char> new_key = CreateObfuscateKey();
    266+    obfuscate_key = 0; // needed for unobfuscated Read
    


    hodlinator commented at 9:19 pm on November 11, 2024:
    0    obfuscate_key = 0; // Needed for unobfuscated Read
    

    l0rinc commented at 11:07 pm on November 11, 2024:
    Done
  82. in src/node/mempool_persist.cpp:64 in a123297318 outdated
    62         if (version == MEMPOOL_DUMP_VERSION_NO_XOR_KEY) {
    63-            // Leave XOR-key empty
    64+            file.SetXor(0);
    65         } else if (version == MEMPOOL_DUMP_VERSION) {
    66-            file >> xor_key;
    67+            std::vector<std::byte> xor_key_vector(8);
    


    hodlinator commented at 9:27 pm on November 11, 2024:

    Here and on line 185:

    0            std::vector<std::byte> xor_key_vector{8};
    

    l0rinc commented at 11:07 pm on November 11, 2024:
    Now I had to rename the PR :D
  83. in src/streams.h:50 in a123297318 outdated
    59+    switch (write.size()) { // Specify the optimization categories
    60+    case 0: break;
    61+    case 1: XorInt(write, key, 1); break;
    62+    case 2: XorInt(write, key, 2); break;
    63+    case 4: XorInt(write, key, 4); break;
    64+    default: XorInt(write, key, write.size());
    


    hodlinator commented at 9:36 pm on November 11, 2024:
    Care to elaborate why we don’t just have the default statement?

    l0rinc commented at 9:59 pm on November 11, 2024:
    “Specify the optimization categories” means that the compiler will be able to optimize the cases where one of the parameters (the size) is a constant separately from each other. The default statement would work, but would be very slow, since the 1, 2 and 4 byte versions won’t be specialized.

    hodlinator commented at 10:42 pm on November 11, 2024:

    How about

    0    // Help optimizers along by sending constant parameter values into the inlined function,
    1    // resulting in more efficient substitutions of memcpy() -> native pow-2 copy instructions.
    2    switch (write.size()) {
    3    case 0: break;
    4    case 1: XorInt(write, key, 1); break;
    5    case 2: XorInt(write, key, 2); break;
    6    case 4: XorInt(write, key, 4); break;
    7    default: XorInt(write, key, write.size());
    

    l0rinc commented at 11:07 pm on November 11, 2024:
    I ended up with only // Help the compiler specialize 1, 2 and 4 byte cases, since the rest was just speculation
  84. hodlinator commented at 9:49 pm on November 11, 2024: contributor

    Reviewed a1232973189126cfc9526713011461709685fcc8

    git range-diff master a5cad72 a123297

    The awk script in the comment in c671dd17dced0d51845dc8d2148f288c4c44ecb2 doesn’t add the ' separators…?

    Care to explain the scaling_factor value?

    XorHistogram claims to use 8 GB of RAM. Could be a bit much if we want to be able to also run benchmarks on low-end devices.

     0diff --git a/src/bench/xor.cpp b/src/bench/xor.cpp
     1index 2ba8b17f08..bb193c7fcf 100644
     2--- a/src/bench/xor.cpp
     3+++ b/src/bench/xor.cpp
     4@@ -96269,18 +96269,13 @@ static void XorHistogram(benchmark::Bench& bench)
     5 
     6         total_bytes += scaled_count * size;
     7         for (size_t j{0}; j < scaled_count; ++j) {
     8-            std::vector<std::byte> ret;
     9-            ret.reserve(size);
    10-            ret.insert(ret.end(), pattern.begin(), pattern.begin() + size);
    11-            test_data.push_back(std::move(ret));
    12+            test_data.emplace_back(pattern.begin(), pattern.begin() + size);
    13         }
    14     }
    15     assert(total_bytes == 8'129'394'848); // ~8 GB of data
    16     std::shuffle(test_data.begin(), test_data.end(), rng); // Make it more realistic & less predictable
    17 
    18-    std::vector key_bytes{rng.randbytes<std::byte>(8)};
    19-    uint64_t key;
    20-    std::memcpy(&key, key_bytes.data(), 8);
    21+    const uint64_t key{rng.rand64()};
    22 
    23     size_t offset{0};
    24     bench.batch(total_bytes).unit("byte").run([&] {
    25@@ -96296,9 +96291,7 @@ static void AutoFileXor(benchmark::Bench& bench)
    26     FastRandomContext rng{/*fDeterministic=*/true};
    27     auto data{rng.randbytes<std::byte>(1'000'000)};
    28 
    29-    std::vector<std::byte> empty_key_bytes(8, std::byte{0}); // Test disabled xor
    30-    uint64_t empty_key;
    31-    std::memcpy(&empty_key, empty_key_bytes.data(), 8);
    32+    uint64_t empty_key{0};
    33 
    34     const fs::path test_path = fs::temp_directory_path() / "xor_benchmark.dat";
    35     AutoFile f{fsbridge::fopen(test_path, "wb+"), empty_key};
    
  85. l0rinc commented at 10:11 pm on November 11, 2024: contributor

    The awk script in the comment in https://github.com/bitcoin/bitcoin/commit/c671dd17dced0d51845dc8d2148f288c4c44ecb2 doesn’t add the ’ separators…?

    I’ve added them manually.

    Care to explain the scaling_factor value?

    The values in the histogram (i.e. total bytes streamed through xor) add up to 92gb, but 1 byte values occupy half of that. (edit: this is only true for the first row, in other cases we would need to multiply by the first column)

    Since we have thousands of big values that represent vector of that size, we have to include all of those into the test set at least once. I had to scale down the histogram such that the lower values, having a few hundred occurrences aren’t flattened out completely - to make the histogram still representative. Do you have a better idea?

    XorHistogram claims to use 8 GB of RAM. Could be a bit much if we want to be able to also run benchmarks on low-end devices.

    Yes, but if I scale it down more, more values will be equal in the histogram and it won’t reflect real usage. That’s why I’ve set it to low priority, we don’t have to run these for every execution.

    Edit: pushed some nits to git range-diff 866f4fa521f6932162570d6531055cc007e3d0cd..a1232973189126cfc9526713011461709685fcc8 866f4fa521f6932162570d6531055cc007e3d0cd..57caa965b5ae284e501f892415d60fcb536f4c0e

  86. l0rinc force-pushed on Nov 11, 2024
  87. l0rinc renamed this:
    optimization: change XOR obfuscation key from `std::vector<std::byte>(8)` to `uint64_t`
    optimization: change XOR obfuscation key from `std::vector<std::byte>{8}` to `uint64_t`
    on Nov 11, 2024
  88. hodlinator commented at 11:43 pm on November 11, 2024: contributor

    Care to explain the scaling_factor value?

    The values in the histogram (i.e. total bytes streamed through xor) add up to ~92gb, but 1 byte values occupy half of that. I had to scale down the histogram such that the lower values, having a few hundred occurrences aren’t flattened out completely - to make the histogram still representative. Do you have a better idea?

    So raw_histogram really is the raw data?

    Added:

    0    uint64_t raw_total_bytes{0};
    1    for (size_t i{0}; i < std::size(raw_histogram); i += 2) {
    2        const uint64_t size = raw_histogram[i];
    3        const uint64_t count = raw_histogram[i + 1];
    4        raw_total_bytes += size * count;
    5    }
    6    printf("raw_total_bytes: %ld", raw_total_bytes);
    

    Got:

    0raw_total_bytes: 1'277'637'746'542 # "'" added manually
    

    So something like 1.28TB instead of 92GB, but I can’t seem to get my head screwed on right today.


    If I re-understood correctly what the code was doing with the scaling - it’s doing only 1'000'000 XOR-passes for the most common size (1 byte) instead of 47'584'838'861, and scaling down the number of XOR-passes for the others by the same factor.

    Thought the following would be slightly faster:

     0diff --git a/src/bench/xor.cpp b/src/bench/xor.cpp
     1index 2ba8b17f08..5e933d74ea 100644
     2--- a/src/bench/xor.cpp
     3+++ b/src/bench/xor.cpp
     4@@ -96253,7 +96253,7 @@ static void XorHistogram(benchmark::Bench& bench)
     5     };
     6 
     7     const auto max_count{static_cast<double>(raw_histogram[1])}; // 1 byte is the most common
     8-    const auto scaling_factor{1'000'000U};
     9+    const auto scaling_factor{1'000'000.0 / max_count};
    10 
    11     FastRandomContext rng{/*fDeterministic=*/true};
    12     const size_t pattern_size = raw_histogram[std::size(raw_histogram) - 2];
    13@@ -96265,7 +96265,7 @@ static void XorHistogram(benchmark::Bench& bench)
    14     for (size_t i{0}; i < std::size(raw_histogram); i += 2) {
    15         const uint64_t size = raw_histogram[i];
    16         const uint64_t count = raw_histogram[i + 1];
    17-        const size_t scaled_count{static_cast<size_t>(std::ceil((static_cast<double>(count) / max_count) * scaling_factor))};
    18+        const size_t scaled_count{static_cast<size_t>(std::ceil(static_cast<double>(count) * scaling_factor))};
    19 
    20         total_bytes += scaled_count * size;
    21         for (size_t j{0}; j < scaled_count; ++j) {
    

    Still passes the total_bytes == 8'129'394'848-assert unmodified, but seems slightly slower in my measurements (including for posterity).

  89. hodlinator commented at 9:21 am on November 12, 2024: contributor

    Reviewed 57caa965b5ae284e501f892415d60fcb536f4c0e

    git range-diff master a123297 57caa96

    • Applied most of my nits that were still valid.

    I understand you were burned by endianness but I disagree that it’s worth sacrificing readability where endianness is a non-issue.

    For cases where endianness is a concern, I suggest the following pattern:

    0uint64_t xor_key;
    1constexpr std::array<uint8_t, sizeof xor_key> xor_pat{std::to_array<uint8_t>({0xff, 0x00, 0xff, 0x00, 0xff, 0x00, 0xff, 0x00})};
    2std::memcpy(&xor_key, xor_pat.data(), sizeof xor_key);
    
    1. Declare the uint64_t first so sizeof can be used in place of magic 8.
    2. Use constexpr std::array instead of const std::vector to constrain the size.
    3. Use to_array initialization since regular brace-initialization will only complain if too many elements are given, not too few.
    4. Use uint8_t in favor of std::byte to improve readability.
    5. Use sizeof <target> for the memcpy call since it is good practice to avoid writing out of bounds and also avoids the magic 8.

    Use ~0ULL to express a uint64_t with all bits set, ensuring one doesn’t forget an f.


     0diff --git a/src/bench/xor.cpp b/src/bench/xor.cpp
     1index f94a1a6f96..fdf41b3696 100644
     2--- a/src/bench/xor.cpp
     3+++ b/src/bench/xor.cpp
     4@@ -96275,10 +96275,7 @@ static void XorHistogram(benchmark::Bench& bench)
     5     assert(total_bytes == 8'129'394'848); // ~8 GB of data
     6     std::shuffle(test_data.begin(), test_data.end(), rng); // Make it more realistic & less predictable
     7 
     8-    std::vector key_bytes{rng.randbytes<std::byte>(8)};
     9-    uint64_t key;
    10-    std::memcpy(&key, key_bytes.data(), 8);
    11-
    12+    const uint64_t key{rng.rand64()};
    13     size_t offset{0};
    14     bench.batch(total_bytes).unit("byte").run([&] {
    15         for (auto& data : test_data) {
    16@@ -96293,10 +96290,7 @@ static void AutoFileXor(benchmark::Bench& bench)
    17     FastRandomContext rng{/*fDeterministic=*/true};
    18     auto data{rng.randbytes<std::byte>(1'000'000)};
    19 
    20-    std::vector<std::byte> empty_key_bytes(8, std::byte{0}); // Test disabled xor
    21-    uint64_t empty_key;
    22-    std::memcpy(&empty_key, empty_key_bytes.data(), 8);
    23-
    24+    const uint64_t empty_key{0}; // Test disabled xor
    25     const fs::path test_path = fs::temp_directory_path() / "xor_benchmark.dat";
    26     AutoFile f{fsbridge::fopen(test_path, "wb+"), empty_key};
    27     bench.batch(data.size()).unit("byte").run([&] {
    28diff --git a/src/test/streams_tests.cpp b/src/test/streams_tests.cpp
    29index 5e9b607b3a..ac51eb3e51 100644
    30--- a/src/test/streams_tests.cpp
    31+++ b/src/test/streams_tests.cpp
    32@@ -77,9 +77,9 @@ BOOST_AUTO_TEST_CASE(xor_file)
    33     auto raw_file{[&](const auto& mode) { return fsbridge::fopen(xor_path, mode); }};
    34     const std::vector<uint8_t> test1{1, 2, 3};
    35     const std::vector<uint8_t> test2{4, 5};
    36-    const std::vector xor_pat{std::byte{0xff}, std::byte{0x00}, std::byte{0xff}, std::byte{0x00}, std::byte{0xff}, std::byte{0x00}, std::byte{0xff}, std::byte{0x00}};
    37     uint64_t xor_key;
    38-    std::memcpy(&xor_key, xor_pat.data(), 8);
    39+    constexpr std::array<uint8_t, sizeof xor_key> xor_pat{std::to_array<uint8_t>({0xff, 0x00, 0xff, 0x00, 0xff, 0x00, 0xff, 0x00})};
    40+    std::memcpy(&xor_key, xor_pat.data(), sizeof xor_key);
    41 
    42     {
    43         // Check errors for missing file
    44@@ -293,12 +293,8 @@ BOOST_AUTO_TEST_CASE(streams_serializedata_xor)
    45     in.push_back(std::byte{0xf0});
    46 
    47     {
    48-        const std::vector xor_pat{std::byte{0xff}, std::byte{0xff}, std::byte{0xff}, std::byte{0xff}, std::byte{0xff}, std::byte{0xff}, std::byte{0xff}, std::byte{0xff}};
    49-        uint64_t xor_key;
    50-        std::memcpy(&xor_key, xor_pat.data(), 8);
    51-
    52         DataStream ds{in};
    53-        ds.Xor(xor_key);
    54+        ds.Xor(~0ULL);
    55         BOOST_CHECK_EQUAL("\xf0\x0f"s, ds.str());
    56     }
    57 
    58@@ -307,9 +303,9 @@ BOOST_AUTO_TEST_CASE(streams_serializedata_xor)
    59     in.push_back(std::byte{0x0f});
    60 
    61     {
    62-        const std::vector xor_pat{std::byte{0xff}, std::byte{0x0f}, std::byte{0xff}, std::byte{0x0f}, std::byte{0xff}, std::byte{0x0f}, std::byte{0xff}, std::byte{0x0f}};
    63         uint64_t xor_key;
    64-        std::memcpy(&xor_key, xor_pat.data(), 8);
    65+        constexpr std::array<uint8_t, sizeof xor_key> xor_pat{std::to_array<uint8_t>({0xff, 0x0f, 0xff, 0x0f, 0xff, 0x0f, 0xff, 0x0f})};
    66+        std::memcpy(&xor_key, xor_pat.data(), sizeof xor_key);
    67 
    68         DataStream ds{in};
    69         ds.Xor(xor_key);
    
  90. fanquake commented at 10:07 am on November 12, 2024: member
    I haven’t really looked at the changes here, but just looking at the diff (+96'000 lines), my feedback would be that you’ll need to change approach in regards to making your data available (if the intent is to have that included), as I doubt we’ll be adding 96'000 lines to bench/xor.cpp. You could look at how we generate a header from bench/data/block413567.raw for a different approach, as including a small binary blob, and parsing it into a header at compile time is far more palatable.
  91. l0rinc commented at 3:16 pm on November 12, 2024: contributor

    Thanks @fanquake, I thought of that, can you please help me understand the constraints? Wouldn’t that require a cmake generation step from binary to header which would basically produce the exact same lines as what we have now? Would it help if I simply extracted it to a separate header file instead?

    So something like 1.28TB instead of 92GB, but I can’t seem to get my head screwed on right today.

    Yeah, I’ve edited that part since, my napkin calculations were only true for the first row, in other cases we would need to multiply by the first column - like you did. But the point is that it’s a lot of data that we have to scale down.

    but seems slightly slower in my measurements

    I thought of that as well, but wanted to avoid floating point conversion (likely the reason for the slowness in your example)

  92. fanquake commented at 5:51 pm on November 12, 2024: member

    Wouldn’t that require a cmake generation step from binary to header which would basically produce the exact same lines as what we have now?

    Yes. See bench/data/block413567.raw & bench/data/block413567.raw.h, where at build time a header file of ~125'000 lines is produced.

    Would it help if I simply extracted it to a separate header file instead?

    I don’t think so. The point is more to not add 100'000s of lines of “data” to this repo, which doesn’t scale across many benchmarks, creates unusable diffs, leaves (source) files unviewable on GH etc.

  93. l0rinc force-pushed on Nov 13, 2024
  94. l0rinc commented at 3:27 pm on November 13, 2024: contributor

    I understand you were burned by endianness but I disagree that it’s worth sacrificing readability where endianness is a non-issue.

    Thanks @hodlinator for the suggestions, I tried them all, but in the end decided that I value consistency more than coming up with a separate solution for each test case. These are ugly, I agree, but at least they’re testing the setup we’re using in prod. I did however change the hard-coded 8 values to sizeof xor_key (for memcpy) or sizeof(uint64_t) for vector inits.

    The point is more to not add 100'000s of lines of “data” to this repo

    I have stored the sorted diffs in the end (since the lines are correlated, i.e. more likely to have similar neighbours) and compressed them using .tar.gz (added the generator python script as a gist, please verify) - this way the histogram data is ~100 kb instead of 1.7 MB (thanks for the hint @fanquake). I’ve extended GenerateHeaderFromRaw.cmake with compression support (adjusting GenerateHeaders.cmake to trim the suffix from the header name) and added more safety asserts to make sure the data read back is the same as before.

  95. l0rinc force-pushed on Dec 2, 2024
  96. in src/streams.h:43 in f2fd1f7c04 outdated
    41+}
    42+inline void Xor(Span<std::byte> write, const uint64_t key)
    43+{
    44+    assert(key);
    45+    for (constexpr auto size = sizeof(uint64_t); write.size() >= size; write = write.subspan(size)) {
    46+        XorInt(write, key, size);
    


    hodlinator commented at 2:51 pm on December 3, 2024:

    nit: Clearer name and source of sizeof expression.

    0    for (constexpr auto chunk_size = sizeof(key); write.size() >= chunk_size; write = write.subspan(chunk_size)) {
    1        XorInt(write, key, chunk_size);
    

    l0rinc commented at 5:31 pm on December 6, 2024:
    Not convinced chunk_size is better - but will use sizeof(key) on next push

    l0rinc commented at 5:36 pm on December 6, 2024:
    Done, kept the name
  97. in src/bench/xor.cpp:1075 in f2fd1f7c04 outdated
    108+    }
    109 }
    110 
    111-BENCHMARK(Xor, benchmark::PriorityLevel::HIGH);
    112+BENCHMARK(XorHistogram, benchmark::PriorityLevel::LOW);
    113+BENCHMARK(AutoFileXor, benchmark::PriorityLevel::LOW);
    


    hodlinator commented at 3:00 pm on December 3, 2024:
    Missing newline at EOF in latest version.

    l0rinc commented at 5:31 pm on December 6, 2024:
    I still don’t see the reason for keeping them, but I’ll add it back on next push if it’s important

    l0rinc commented at 5:36 pm on December 6, 2024:
    Done
  98. in src/bench/xor.cpp:34 in f2fd1f7c04 outdated
    33+}
    34+
    35+static void XorHistogram(benchmark::Bench& bench)
    36+{
    37+    // The histogram represents util::Xor method's write.size() histograms for the first 860k blocks
    38+    // aggregated and encoded with https://gist.github.com/l0rinc/a44da845ad32ec89c30525507cdd28ee
    


    hodlinator commented at 3:28 pm on December 3, 2024:

    nits: Although there is precedent for adding references to gists, I’m not sure we should encourage it. Would have preferred a file in this repo’s contrib/ directory.

    Would also have preferred that it did the full calculation of the .tgz file instead of having a hard-coded array, computing it by looking at linearalized blocks on disk.


    l0rinc commented at 5:32 pm on December 6, 2024:
    Removed
  99. hodlinator commented at 4:04 pm on December 3, 2024: contributor

    Code reviewed f2fd1f7c043a2782cb2bf3c9fe7e2f94c17728b5

    0₿ git range-diff master 57caa96 f2fd1f7
    1₿ git show e314bb7e00 > old
    2₿ git show 91a8fde051 > new
    3₿ meld old new
    

    Thanks for using more constexpr std::arrays and clearer sizeofs!

    Nice that block data could be compressed to such a large extent.

    nit: Would prefer the *.cmake changes where broken out into their own commit, keeping only src/bench/CMakeLists.txt as part of the benchmark change.

  100. ryanofsky commented at 5:00 pm on December 3, 2024: contributor

    Concept ACK, but curious for more feedback from @maflcko about this PR. The actual code changes here do not seem too complicated but maybe they make the code less generic. I wonder if you think there are concrete downsides to this PR, or if the changes are ok but possibly not be worth the review effort (as #31144 (comment) seems to suggest)

    I’m happy to spend time reviewing this if it improves performance and doesn’t cause other problems.

    this way the histogram data is ~100 kb instead of 1.7 MB

    Current approach seems ok to me, but wondering it it might be better to just use a sampling of the most common write sizes instead of including the entire histogram. It seems like if you take the top 50 sizes it covers 99.6% of the writes, and might make the test more maintainable and PR easier to understand without changing results too much.

    using histogram from https://gist.github.com/l0rinc/a44da845ad32ec89c30525507cdd28ee

     0    cut = 50
     1    hist_count = rest_count = 0
     2    histogram.sort(key=lambda h: (-h[1]*h[0]))
     3    for i, (size, count) in enumerate(histogram):
     4        if i < cut:
     5           print(f"{size=}, {count=}")
     6           hist_count += count
     7        else:
     8           rest_count += count
     9
    10    print()
    11    print(f"{hist_count=} {hist_count/(hist_count+rest_count)*100:.1f}%")
    12    print(f"{rest_count=} {rest_count/(hist_count+rest_count)*100:.1f}%")
    
     0size=32, count=5369404406
     1size=106, count=1193555153
     2size=71, count=1763349816
     3size=107, count=1064027363
     4size=25, count=2705183236
     5size=33, count=2025696499
     6size=4, count=14983070199
     7size=72, count=827712501
     8size=23, count=2357347372
     9size=1, count=47584838861
    10size=8, count=5939128629
    11size=21, count=2019807250
    12size=22, count=1616301504
    13size=34, count=682781008
    14size=139, count=155158814
    15size=253, count=83729340
    16size=105, count=198656556
    17size=64, count=275253628
    18size=4096, count=3905157
    19size=138, count=112918349
    20size=252, count=48105681
    21size=254, count=38940977
    22size=35, count=269339238
    23size=218, count=25918921
    24size=140, count=37826769
    25size=83, count=53818260
    26size=217, count=13625647
    27size=219, count=12992432
    28size=108, count=24662023
    29size=488, count=4909014
    30size=27, count=87057844
    31size=489, count=4483016
    32size=28, count=78184146
    33size=40, count=50600412
    34size=113, count=17829712
    35size=65, count=29197009
    36size=128, count=13323530
    37size=123, count=12986923
    38size=37, count=41467794
    39size=114, count=9826401
    40size=26, count=35870265
    41size=130, count=7111451
    42size=29, count=30858682
    43size=487, count=1836899
    44size=70, count=12772097
    45size=131, count=6398914
    46size=109, count=7506161
    47size=490, count=1380253
    48size=95901, count=6384
    49size=30, count=19197417
    50
    51hist_count=91959859913 99.6%
    52rest_count=382932220 0.4%
    
  101. maflcko commented at 5:09 pm on December 3, 2024: member

    Concept ACK, but curious for more feedback from @maflcko about this PR. The actual code changes here do not seem too complicated but maybe they make the code less generic.

    There is a good chance that increasing the size of the vector is insufficient, if there is ever a need to increase it to more than 8 bytes, so a complete rewrite may be needed in that case anyway. However, this is just my guess and only time will tell. So I’d say this change is probably fine for now.

    Would still be nice to if there was a way to take all of it out of the hot path (possibly with higher overall benefits), but I don’t know if such a change is possible and will replace this pull.

  102. l0rinc commented at 5:44 pm on December 3, 2024: contributor

    Would still be nice to if there was a way to take all of it out of the hot path

    Can you give me hints on how to do that? Since we have a primitive as a key now, we already skip xor with 0 value now, see https://github.com/bitcoin/bitcoin/pull/31144/files#diff-4020c723bb55e114bdc7ff769086a765dcc7ccfb61da2047a315db16c0c7a8b4R295

    but wondering it it might be better to just use a sampling of the most common write sizes @fanquake mentioned that he thinks this benchmark could be useful - if he’s fine with the truncated version as well, I’ll simplify (would solve some of @hodlinator’s cmake concerns as well).

  103. maflcko commented at 6:33 pm on December 3, 2024: member

    It seems like if you take the top 50 sizes it covers 99.6% of the writes

    I had the same thought. Obviously there could be an unlikely problem if the remaining 0.04% of writes accounted for the majority of the time, but that seems unlikely. Other than that, taking only the top N seems preferable.

  104. l0rinc commented at 9:38 am on December 5, 2024: contributor

    Would still be nice to if there was a way to take all of it out of the hot path

    Since blocks are XOR-ed as well, I can’t meaningfully test it with a reindex(-chainstate), so I did 2 full IBDs until 800k blocks, rebased after #30039 with -blocksxor=0 to test whether we can disable xor completely now.

    0hyperfine \
    1--runs 2 \
    2--export-json /mnt/my_storage/IBD-xor-rebased.json \
    3--parameter-list COMMIT e1074081c9f1895a4f629dfee347ceae484a10d3,f2fd1f7c043a2782cb2bf3c9fe7e2f94c17728b5 \
    4--prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' \
    5'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -blocksxor=0 -dbcache=10000 -printtoconsole=0'
    
    0Benchmark 1: COMMIT=e1074081c9f1895a4f629dfee347ceae484a10d3 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -blocksxor=0 -dbcache=10000 -printtoconsole=0
    1  Time (mean ± σ):     25797.921 s ± 61.629 s    [User: 26803.189 s, System: 1457.936 s]
    2  Range (min … max):   25754.343 s … 25841.500 s    2 runs
    3 
    4Benchmark 2: COMMIT=f2fd1f7c043a2782cb2bf3c9fe7e2f94c17728b5 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -blocksxor=0 -dbcache=10000 -printtoconsole=0
    5  Time (mean ± σ):     23751.046 s ± 342.376 s    [User: 25322.345 s, System: 1509.236 s]
    6  Range (min … max):   23508.949 s … 23993.142 s    2 runs
    

    Which indicates a 9% speedup compared to the baseline:

    0Summary
    1  COMMIT=f2fd1f7c043a2782cb2bf3c9fe7e2f94c17728b5 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -blocksxor=0 -dbcache=10000 -printtoconsole=0 ran
    2    1.09 ± 0.02 times faster than COMMIT=e1074081c9f1895a4f629dfee347ceae484a10d3 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -blocksxor=0 -dbcache=10000 -printtoconsole=0
    
  105. maflcko commented at 9:46 am on December 5, 2024: member

    Would still be nice to if there was a way to take all of it out of the hot path (possibly with higher overall benefits), but I don’t know if such a change is possible and will replace this pull.

    Actually, this change here also affects RPC performance, not just internal validation, so this can be done in a follow-up or separate pull, if it is possible at all.

  106. l0rinc commented at 1:48 pm on December 6, 2024: contributor

    More context: The previous benchmark was for completely turning off XOR - but we can only do that for new IBD by explicitly setting it to 0. But for the majority of cases we likely want to still to the xor, so this PR is meant to speed it up. I have remeasured it with doing full IBD until 800k blocks (two runs to measure stability, since reindex wouldn’t cover all usages of XOR):

    0hyperfine \
    1--runs 2 \
    2--export-json /mnt/my_storage/IBD-xor-rebased.json \
    3--parameter-list COMMIT e1074081c9f1895a4f629dfee347ceae484a10d3,f2fd1f7c043a2782cb2bf3c9fe7e2f94c17728b5 \
    4--prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' \
    5'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=10000 -printtoconsole=0'
    
    0Benchmark 1: COMMIT=e1074081c9f1895a4f629dfee347ceae484a10d3 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=10000 -printtoconsole=0
    1  Time (mean ± σ):     25601.461 s ± 65.686 s    [User: 27025.116 s, System: 1586.908 s]
    2  Range (min … max):   25555.014 s … 25647.907 s    2 runs
    3 
    4Benchmark 2: COMMIT=f2fd1f7c043a2782cb2bf3c9fe7e2f94c17728b5 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=10000 -printtoconsole=0
    5  Time (mean ± σ):     24526.781 s ± 389.029 s    [User: 25525.801 s, System: 1552.625 s]
    6  Range (min … max):   24251.697 s … 24801.866 s    2 runs
    

    Which indicates that this will speed up IBD by roughly 4% on average (now that 30039 was merged the difference is more obvious):

    0Summary
    1 COMMIT=f2fd1f7c043a2782cb2bf3c9fe7e2f94c17728b5 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=10000 -printtoconsole=0 ran
    2   1.04 ± 0.02 times faster than COMMIT=e1074081c9f1895a4f629dfee347ceae484a10d3 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbcache=10000 -printtoconsole=0
    
  107. l0rinc renamed this:
    optimization: change XOR obfuscation key from `std::vector<std::byte>{8}` to `uint64_t`
    optimization: speed up XOR by 4% (9% when disabled) by applying it in larger batches
    on Dec 6, 2024
  108. l0rinc force-pushed on Dec 6, 2024
  109. l0rinc commented at 5:30 pm on December 6, 2024: contributor

    Thanks for the reviews and hints, I’ve pushed the following changes:

    • Reverted all cmake changes @hodlinator mentioned and histogram archives, and based on the hints of @ryanofsky and @maflcko I’ve kept only the top entries (by frequency, re-sorted by size), making sure that the really big write-vectors are also covered (so kept the first 1000 instead of just the first 50). This enabled putting all the data in the sourcefile.
    • Added Assumes to each xor to check that we don’t have any useless calls with 0 keys - making sure we “turn off” the feature when we can.
  110. l0rinc force-pushed on Dec 6, 2024
  111. in src/streams.h:58 in b9c847fd09 outdated
    68+
    69+inline uint64_t RotateKey(const uint64_t key, const size_t key_offset)
    70+{
    71+    Assume(key);
    72+    size_t key_rotation{8 * key_offset};
    73+    if (key_rotation % 64 == 0) return key;
    


    maflcko commented at 1:37 pm on December 7, 2024:
    std::rotr can handle it fine, so this can be removed. Also the codegen is better without this for some reason?

    l0rinc commented at 2:12 pm on December 10, 2024:
    yes, it’s an optimization to avoid doing any rotation when it would wrap around - it would work without this as well. It’s not a measurable speedup, though, so I can remove it if you insist.
  112. in src/node/blockstorage.cpp:1178 in b9c847fd09 outdated
    1173@@ -1174,7 +1174,9 @@ static auto InitBlocksdirXorKey(const BlockManager::Options& opts)
    1174         };
    1175     }
    1176     LogInfo("Using obfuscation key for blocksdir *.dat files (%s): '%s'\n", fs::PathToString(opts.blocks_dir), HexStr(xor_key));
    1177-    return std::vector<std::byte>{xor_key.begin(), xor_key.end()};
    1178+    uint64_t key;
    1179+    std::memcpy(&key, xor_key.data(), sizeof key);
    


    maflcko commented at 1:20 pm on December 10, 2024:

    I wonder if it wouldn’t be better to have a class take care of construction, so that the memcpy is limited to one place. Something like:

    0class XorKey{
    1 uint64_t m_data;
    2 XorKey(std::span<std::byte, 8> sp);
    3}
    

    IIUC the compilers should still be able to treat this as an integral in codegen.


    l0rinc commented at 2:23 pm on December 10, 2024:

    You mean like a

    0union {
    1    uint64_t as_uint64; 
    2    std::array<std::byte, 8> as_bytes;
    3};
    

    or just a static method that initializes the uint64_t in a single place? Or should we encapsulate the XorKey concept behind a struct as well?


    sipa commented at 2:59 pm on December 10, 2024:

    It is undefined behavior in C++ to read from a union member that wasn’t the one most recently written to.

    Using memcpy between a data type and something (equivalent to) a char/byte array is not UB.


    maflcko commented at 5:49 pm on December 10, 2024:

    You mean like a

    0union {
    1    uint64_t as_uint64; 
    2    std::array<std::byte, 8> as_bytes;
    3};
    

    No, I mean just a simple struct. Have you seen my suggestion?

    The benefits would be that passing the integral value around would be type-safe. Also, the endian-considerations are fully contained in a simple struct, as opposed to the xor internal implementation detail and all modules that use xor. Finally, the memcpy would also be contained in a single place. Overall this could make the code smaller, or not, depending on how many users are there. However, in any case, encapsulating the assumptions around type-safety, and endianness would already be worth it in my view.

  113. in src/test/streams_tests.cpp:17 in e1074081c9 outdated
    13@@ -14,13 +14,73 @@ using namespace std::string_literals;
    14 
    15 BOOST_FIXTURE_TEST_SUITE(streams_tests, BasicTestingSetup)
    16 
    17+BOOST_AUTO_TEST_CASE(xor_roundtrip_random_chunks)
    


    ryanofsky commented at 3:49 pm on December 10, 2024:

    In commit “test: Compare util::Xor with randomized inputs against simple impl” (e1074081c9f1895a4f629dfee347ceae484a10d3)

    Would be good to add comment explaining test. Test seems to be to encoding and then decoding random byte vectors with random 8 byte xor keys using differently-sized and differently-aligned random chunks for encoding and decoding, and then making sure the byte vectors are unchanged after the round trip.

  114. in src/test/streams_tests.cpp:65 in e1074081c9 outdated
    42+        apply_random_xor_chunks(roundtrip, key_bytes, rng);
    43+        BOOST_CHECK(original == roundtrip);
    44+    }
    45+}
    46+
    47+BOOST_AUTO_TEST_CASE(xor_bytes_reference)
    


    ryanofsky commented at 3:56 pm on December 10, 2024:

    In commit “test: Compare util::Xor with randomized inputs against simple impl” (e1074081c9f1895a4f629dfee347ceae484a10d3)

    Again adding a test comment could be helpful here. This test is making sure the util::Xor function returns same results as a naive byte-by-byte xor with an 8-byte key, using 100 random sized random byte vectors.

    Would suggest moving xor_bytes_reference test up before the xor_roundtrip_random_chunks since it seems like a simpler test that’s an easier introduction to this code and could be followed by more complicated tests.

  115. in src/bench/xor.cpp:17 in caafbd0692 outdated
    16+static void XorHistogram(benchmark::Bench& bench)
    17 {
    18-    FastRandomContext frc{/*fDeterministic=*/true};
    19-    auto data{frc.randbytes<std::byte>(1024)};
    20-    auto key{frc.randbytes<std::byte>(31)};
    21+    // The top util::Xor method's [write.size(), frequency] calls for the IBD of the first 860k blocks
    


    ryanofsky commented at 5:07 pm on December 10, 2024:

    re: #31144 (comment)

    I think instead of taking the top 1000 calls by count, it would make sense to take the top X calls by (size*count) as suggested #31144 (comment), where X could be smaller than 1000, because (size*count) should more closely approximate the time spent on all writes of a given size than count ignoring size. This should make the test more realistic and also allow shrinking the histogram.

  116. in src/test/streams_tests.cpp:61 in e1074081c9 outdated
    56+    for (size_t test{0}; test < 100; ++test) {
    57+        const size_t write_size{1 + rng.randrange(100U)};
    58+        const size_t key_offset{rng.randrange(3 * 8U)}; // Should wrap around
    59+
    60+        std::vector key_bytes{rng.randbytes<std::byte>(sizeof(uint64_t))};
    61+        uint64_t key;
    


    ryanofsky commented at 11:00 pm on December 10, 2024:

    In commit “test: Compare util::Xor with randomized inputs against simple impl” (e1074081c9f1895a4f629dfee347ceae484a10d3)

    Would be good to use consistent variable names in these tests. The other tests are calling key vectors xor_pat and calling key values xor_key, while these tests are calling key vectors key_bytes and calling key values key. Would be clearer to use consistent names.

    Also, after this PR each test is keeping two different variables and two different representations for each key. It would be good clean this up afterwards and just have one variable per key. Using a dedicated type for keys like the XorKey struct Marco suggested #31144 (review) would be even more ideal.

  117. in src/bench/xor.cpp:15 in caafbd0692 outdated
    11 #include <cstddef>
    12+#include <map>
    13 #include <vector>
    14 
    15-static void Xor(benchmark::Bench& bench)
    16+static void XorHistogram(benchmark::Bench& bench)
    


    ryanofsky commented at 11:23 pm on December 10, 2024:

    In commit “bench: Make Xor benchmark more representative” (caafbd069246848a8bdfc2f42fd1d692a824de94)

    Would be a helpful to have a comment saying what this benchmark is measuring. Maybe: // Measure speed of util::Xor function applied to a set of byte vectors. The byte vectors are filled with random data and have sizes matching a distribution of data write sizes observed during IBD.

  118. in src/bench/xor.cpp:1041 in caafbd0692 outdated
    1040+            test_data.emplace_back(rand_bytes);
    1041+        }
    1042+    }
    1043+    assert(total_bytes == 114'929'502);
    1044+
    1045+    std::ranges::shuffle(test_data, rng); // Make it more realistic & less predictable
    


    ryanofsky commented at 11:48 pm on December 10, 2024:

    In commit “bench: Make Xor benchmark more representative” (caafbd069246848a8bdfc2f42fd1d692a824de94)

    It seems awkward to have code with all these hardcoded values that will make it hard to update the histogram, and to end up with test data we can’t directly control the size of. Would be cleaner to just choose how much data to generate, and generate it without hardcoding values from the histogram. Would suggest something more like the following, with an easily adjustable size that doesn’t hardcode other values.

     0std::vector<std::vector<std::byte>> test_data(1'000'000);
     1FastRandomContext rng{/*fDeterministic=*/true};
     2
     3std::vector<std::pair<uint64_t, size_t>> cumulative_count;
     4uint64_t total_count = 0;
     5for (const auto& [size, count] : histogram) {
     6    total_count += count;
     7    cumulative_count.emplace_back(total_count, size);
     8}
     9
    10auto sample_size = [&]() -> size_t {
    11    uint64_t rand_val = rng.randrange<uint64_t>(0, total_count);
    12    return std::lower_bound(
    13        cumulative_count.begin(), cumulative_count.end(), rand_val,
    14        [](const std::pair<uint64_t, size_t>& entry, uint64_t value) {
    15            return entry.first < value;
    16        })->second;
    17};
    18
    19uint64_t total_bytes{0};
    20for (auto& entry : test_data) {
    21    entry = rng.randbytes<std::byte>(sample_size());
    22    total_bytes += entry.size();
    23}
    24
    25std::ranges::shuffle(test_data, rng);
    

    (code was generated by chatgpt)

  119. in src/bench/xor.cpp:1056 in caafbd0692 outdated
    1055+        }
    1056+        ankerl::nanobench::doNotOptimizeAway(test_data);
    1057+    });
    1058+}
    1059+
    1060+static void AutoFileXor(benchmark::Bench& bench)
    


    ryanofsky commented at 0:03 am on December 11, 2024:

    In commit “bench: Make Xor benchmark more representative” (caafbd069246848a8bdfc2f42fd1d692a824de94)

    Not really sure I understand the goal if this benchmark. Is it significant that the xor key is 0? Would be helpful to have a description of what this benchmark is measuring and indicating.

  120. in src/bench/xor.cpp:1064 in b9c847fd09 outdated
    1063+    const auto data{rng.randbytes<std::byte>(1'000'000)};
    1064+
    1065+    const std::vector empty_key_bytes(8, std::byte{0}); // Test disabled xor
    1066+    uint64_t empty_key;
    1067+    std::memcpy(&empty_key, empty_key_bytes.data(), 8);
    1068+
    


    ryanofsky commented at 0:05 am on December 11, 2024:

    In commit “bench: Make Xor benchmark more representative” (caafbd069246848a8bdfc2f42fd1d692a824de94)

    Again this looks like a nice place to replace these dual key representations with a clean XorKey value.

  121. in src/streams.h:31 in 7a2e5ec977 outdated
    28 #include <vector>
    29+#include <util/check.h>
    30 
    31 namespace util {
    32-inline void Xor(Span<std::byte> write, Span<const std::byte> key, size_t key_offset = 0)
    33+inline void XorInt(Span<std::byte> write, const uint64_t key, const size_t size)
    


    ryanofsky commented at 0:17 am on December 11, 2024:

    In commit “optimization: Xor 64 bits together instead of byte-by-byte” (7a2e5ec97700584eeac6f8b08ef697df6a147606)

    Having this size parameter is confusing and seems unnecessary. This would be clearer as:

    0inline void XorInt(Span<std::byte> write, const uint64_t key)
    1{
    2    uint64_t raw = 0;
    3    memcpy(&raw, write.data(), write.size());
    4    raw ^= key;
    5    memcpy(write.data(), &raw, write.size());
    6}
    

    Call sites would need to use .subspan(0, size) but this would also make them more obvious.

  122. in src/streams.h:33 in 7a2e5ec977 outdated
    32-inline void Xor(Span<std::byte> write, Span<const std::byte> key, size_t key_offset = 0)
    33+inline void XorInt(Span<std::byte> write, const uint64_t key, const size_t size)
    34 {
    35-    if (key.size() == 0) {
    36-        return;
    37+    Assume(key);
    


    ryanofsky commented at 0:28 am on December 11, 2024:

    In commit “optimization: Xor 64 bits together instead of byte-by-byte” (7a2e5ec97700584eeac6f8b08ef697df6a147606)

    Seems like it would be less fragile and would simplify callers to replace this Assume(key) with if (!key) return;


    l0rinc commented at 12:19 pm on December 11, 2024:
    I could do that, but the Assume here was meant to make sure no calls get here when the key is 0 in the first place - since that can usually eliminate other work as well (e.g. MakeWritableByteSpan in DataStream#Xor)
  123. in src/dbwrapper.cpp:262 in 7a2e5ec977 outdated
    266-        std::vector<unsigned char> new_key = CreateObfuscateKey();
    267+    obfuscate_key = std::vector<unsigned char>(OBFUSCATE_KEY_NUM_BYTES, '\000'); // Needed for unobfuscated Read
    268+    const bool key_missing = !Read(OBFUSCATE_KEY_KEY, obfuscate_key);
    269+    if (key_missing && params.obfuscate && IsEmpty()) {
    270+        // Initialize non-degenerate obfuscation if it won't upset existing, non-obfuscated data.
    271+        const std::vector<unsigned char> new_key = CreateObfuscateKey();
    


    ryanofsky commented at 0:36 am on December 11, 2024:

    In commit “optimization: Xor 64 bits together instead of byte-by-byte” (7a2e5ec97700584eeac6f8b08ef697df6a147606)

    I’m confused why this code is changing in this commit when it seems unrelated to the stream.h optimization the behavior seems the same as before? If it his is a refactoring cleanup it would be good to move it to a separate commit explaining that it is a refactoring and what the point of these changes may be.

  124. in src/node/mempool_persist.cpp:62 in 7a2e5ec977 outdated
    57@@ -58,15 +58,16 @@ bool LoadMempool(CTxMemPool& pool, const fs::path& load_path, Chainstate& active
    58     try {
    59         uint64_t version;
    60         file >> version;
    61-        std::vector<std::byte> xor_key;
    62         if (version == MEMPOOL_DUMP_VERSION_NO_XOR_KEY) {
    63-            // Leave XOR-key empty
    64+            const std::vector xor_key(sizeof(uint64_t), std::byte{'\000'});
    


    ryanofsky commented at 0:42 am on December 11, 2024:

    In commit “optimization: Xor 64 bits together instead of byte-by-byte” (7a2e5ec97700584eeac6f8b08ef697df6a147606)

    Are all the changes in this file also a refactoring that don’t change behavior? I don’t understand why these changes are in a commit that is supposed to be optimizing stream.h behavior. Would suggest splitting this commit up and and explaining what purpose of these changes are. Maybe they would make more sense in the next commit so this code is not changing twice?

  125. ryanofsky commented at 1:15 am on December 11, 2024: contributor

    Code review b9c847fd093d100628817af98fe837db938160f7. These changes look good and make sense, and I reviewed almost everything but have a few pieces of feedback:

    • I would very strongly endorse Marco’s suggestion to represent keys with an XorKey struct instead of raw uint64_t values so the code getting and setting keys is simpler, safer, and more uniform, and we can avoid a proliferation of memcpy calls.
    • I don’t think I understand structure of the third and fourth commits. The third commit seems to be adding an optimization to streams.h but also refactoring code not directly related to streams.h and then the fourth commit is refactoring a lot of the same code that was just refactored, and some of the code that was optimized as well. Would suggest doing this more cleanly in 3 commits:
      • First commit adding optimized stream.h/stream.cpp functions and wrappers to provide backwards compatibility so no other code or tests have to change in the commit.
      • Second commit updating code and tests to call the optimized stream.h API instead of the backwards compatibility wrappers.
      • Third commit deleting stream.h backwards compatibility wrappers.
  126. l0rinc marked this as a draft on Dec 14, 2024
  127. bench: add SaveBlockToDiskBench 3f0514f905
  128. refactor,blocks: inline `UndoWriteToDisk`
    `UndoWriteToDisk` wasn't really extracting a meaningful subset of the `WriteUndoDataForBlock` functionality, it's tied closely to the only caller (needs the header size twice, recalculated undo serializes size, returns multiple branches, modifies parameter, needs documentation).
    
    The inlined code should only differ in these parts (modernization will be done in other commits):
    * renamed `_pos` to `pos` in `WriteUndoDataForBlock` to match the parameter name;
    * inlined `hashBlock` parameter usage into `hasher << block.pprev->GetBlockHash()`;
    * changed `return false` to `return FatalError`.
    
    Co-authored-by: Ryan Ofsky <ryan@ofsky.org>
    b91d55cf26
  129. refactor,blocks: inline `WriteBlockToDisk`
    `WriteBlockToDisk` wasn't really extracting a meaningful subset of the `SaveBlockToDisk` functionality, it's tied closely to the only caller (needs the header size twice, recalculated block serializes size, returns multiple branches, mutates parameter).
    
    The inlined code should only differ in these parts (modernization will be done in other commits):
    * renamed `blockPos` to `pos` in `SaveBlockToDisk` to match the parameter name;
    * changed `return false` to `return FlatFilePos()`.
    
    Also removed remaining references to `SaveBlockToDisk`.
    
    Co-authored-by: Ryan Ofsky <ryan@ofsky.org>
    df77c0d704
  130. refactor,blocks: cache block serialized size for consecutive calls
    For consistency `UNDO_DATA_DISK_OVERHEAD` was also extracted to avoid the constant's ambiguity.
    4b8226b5b7
  131. refactor,blocks: remove costly asserts
    When the behavior was changes in a previous commit (caching `GetSerializeSize` and avoiding `AutoFile.tell`), asserts were added to make sure the behavior was kept - to make sure reviewers and CI validates it.
    We can safely remove them now.
    
    Co-authored-by: Anthony Towns <aj@erisian.com.au>
    42d157d596
  132. scripted-diff: rename block and undo functions for consistency
    Co-authored-by: Ryan Ofsky <ryan@ofsky.org>
    
    -BEGIN VERIFY SCRIPT-
    sed -i \
        -e 's/\bSaveBlockToDisk\b/SaveBlock/g' \
        -e 's/\bWriteUndoDataForBlock\b/SaveBlockUndo/g' \
        $(git ls-files)
    -END VERIFY SCRIPT-
    92abdde7bb
  133. optimization: Bulk serialization reads in `UndoReadFromDisk` and `ReadBlockFromDisk`
    The Obfuscation (XOR) operations are currently done byte-by-byte during serialization, buffering the reads will enable batching the obfuscation operations later (not yet done here).
    
    Also, different operating systems seem to handle file caching differently, so reading bigger batches (and processing those from memory) is also a bit faster (likely because of fewer native fread calls or less locking).
    
    Before:
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        2,285,753.56 |              437.49 |    0.3% |     11.03 | `ReadBlockFromDiskBench`
    
    After:
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        1,804,240.57 |              554.25 |    0.2% |     11.01 | `ReadBlockFromDiskBench`
    
    Co-authored-by: Cory Fields <cory-nospam-@coryfields.com>
    34eafa54a6
  134. optimization: Bulk serialization writes in `SaveBlockUndo` and `SaveBlock`
    Similarly to the serialization reads, buffered writes will enable batched xor calculations - especially since currently we need to copy the write inputs Span to do the obfuscation on it, batching enables doing the xor on the internal buffer instead.
    
    Before:
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        5,244,636.51 |              190.67 |    0.3% |     11.04 | `SaveBlockToDiskBench`
    
    After:
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        1,787,371.46 |              559.48 |    1.6% |     11.16 | `SaveBlockToDiskBench`
    
    Co-authored-by: Cory Fields <cory-nospam-@coryfields.com>
    3ca3e59501
  135. test: Compare util::Xor with randomized inputs against simple impl
    Since production code only uses keys of length 8, we're not testing with other values anymore
    ae75c72db5
  136. l0rinc force-pushed on Dec 21, 2024
  137. bench: Make Xor benchmark more representative
    To make the benchmarks representative, I've collected the write-vector's sizes during IBD for every invocation of `util::Xor` until 860k blocks, and used it as a basis for the micro-benchmarks, having a similar distribution of random data (taking the 1000 most frequent ones, making sure the very big ones are also covered).
    
    And even though we already have serialization tests, `AutoFileXor` was added to serializing 1 MB via the provided key_bytes.
    This was used to test the effect of disabling obfuscation.
    
    >  cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build -j$(nproc) \
    && build/src/bench/bench_bitcoin -filter='XorHistogram|AutoFileXor' -min-time=10000
    
    C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |             ns/byte |              byte/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |                1.07 |      937,527,289.88 |    0.4% |     10.24 | `AutoFileXor`
    |                0.87 |    1,149,859,017.49 |    0.3% |     10.80 | `XorHistogram`
    
    C++ compiler .......................... GNU 13.2.0
    
    |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |                1.87 |      535,253,389.72 |    0.0% |            9.20 |            3.45 |  2.669 |           1.03 |    0.1% |     11.02 | `AutoFileXor`
    |                1.70 |      587,844,715.57 |    0.0% |            9.35 |            5.41 |  1.729 |           1.05 |    1.7% |     10.95 | `XorHistogram`
    df516ee5e3
  138. DrahtBot added the label CI failed on Dec 21, 2024
  139. DrahtBot commented at 5:07 pm on December 21, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/34748857550

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  140. l0rinc force-pushed on Dec 21, 2024
  141. l0rinc renamed this:
    optimization: speed up XOR by 4% (9% when disabled) by applying it in larger batches
    optimization: batch XOR operations 12% faster IBD
    on Dec 22, 2024
  142. l0rinc commented at 11:42 am on December 22, 2024: contributor

    The PR has been split into 3 to simplify review, please check those out first:

    The 3 changes together achieve a 12.3% speedup in raw IBD (real nodes, multiple runs, 870k blocks, 1Gb dbcache, ssd)

     0hyperfine \
     1--runs 2 \
     2--parameter-list COMMIT d73f37dda221835b5109ede1b84db2dc7c4b74a1,fe7365584bb3703e5691c93fb004772e84db3697 \
     3--prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' \
     4'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -dbcache=1000 -printtoconsole=0'                                                          
     5  
     6Benchmark 1: COMMIT=d73f37dda221835b5109ede1b84db2dc7c4b74a1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -dbcache=1000 -printtoconsole=0
     7  Time (mean ± σ):     34098.690 s ± 175.888 s    [User: 51918.900 s, System: 2898.126 s]
     8  Range (min … max):   33974.318 s … 34223.062 s    2 runs
     9  
    10Benchmark 2: COMMIT=fe7365584bb3703e5691c93fb004772e84db3697 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -dbcache=1000 -printtoconsole=0
    11  Time (mean ± σ):     30359.460 s ± 417.536 s    [User: 48322.766 s, System: 2898.793 s]
    12  Range (min … max):   30064.218 s … 30654.703 s    2 runs
    
  143. l0rinc force-pushed on Dec 22, 2024
  144. l0rinc force-pushed on Dec 23, 2024
  145. l0rinc force-pushed on Dec 23, 2024
  146. optimization: Xor 64 bits together instead of byte-by-byte
    `util::Xor` method was split out into more focused parts:
    * one which assumes tha the `uint64_t` key is properly aligned, doing the first few xors as 64 bits (the memcpy is eliminated in most compilers), and the last iteration is optimized for 8/16/32 bytes.
    * an unaligned `uint64_t` key with a `key_offset` parameter which is rotated to accommodate the data (adjusting for endianness).
    * a legacy `std::vector<std::byte>` key with an asserted 8 byte size, converted to `uint64_t`.
    
    Note that the default statement alone would pass the tests, but would be very slow, since the 1, 2 and 4 byte versions won't be specialized by the compiler, hence the switch.
    
    Asserts were added throughout the code to make sure every such vector has length 8, since in the next commit we're converting all of them to `uint64_t`.
    
    refactor: Migrate fixed-size obfuscation end-to-end from `std::vector<std::byte>` to `uint64_t`
    
    Since `util::Xor` accepts `uint64_t` values, we're eliminating any repeated vector-to-uint64_t conversions going back to the loading/saving of these values (we're still serializing them as vectors, but converting as soon as possible to `uint64_t`). This is the reason the tests still generate vector values and convert to `uint64_t` later instead of generating it directly.
    
    We're also short-circuit `Xor` calls with 0 key values early to avoid unnecessary calculations (e.g. `MakeWritableByteSpan`) - even assuming that XOR is never called for 0.
    
    >  cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build -j$(nproc) \
    && build/src/bench/bench_bitcoin -filter='XorHistogram|AutoFileXor' -min-time=10000
    
    C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |             ns/byte |              byte/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |                0.09 |   10,799,585,470.46 |    1.3% |     11.00 | `AutoFileXor`
    |                0.14 |    7,144,743,097.97 |    0.2% |     11.01 | `XorHistogram`
    
    C++ compiler .......................... GNU 13.2.0
    
    |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |                0.59 |    1,706,433,032.76 |    0.1% |            0.00 |            0.00 |  0.620 |           0.00 |    1.8% |     11.01 | `AutoFileXor`
    |                0.47 |    2,145,375,849.71 |    0.0% |            0.95 |            1.48 |  0.642 |           0.20 |    9.6% |     10.93 | `XorHistogram`
    
    ----
    
    A few other benchmarks that seem to have improved as well (tested with Clang only):
    Before:
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        2,237,168.64 |              446.99 |    0.3% |     10.91 | `ReadBlockFromDiskTest`
    |          748,837.59 |            1,335.40 |    0.2% |     10.68 | `ReadRawBlockFromDiskTest`
    
    After:
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        1,827,436.12 |              547.21 |    0.7% |     10.95 | `ReadBlockFromDiskTest`
    |           49,276.48 |           20,293.66 |    0.2% |     10.99 | `ReadRawBlockFromDiskTest`
    898a07e2ab
  147. l0rinc force-pushed on Dec 24, 2024
  148. DrahtBot removed the label CI failed on Dec 24, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-27 00:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me