optimization: change XOR obfuscation key from std::vector<std::byte>{8} to uint64_t #31144

pull l0rinc wants to merge 4 commits into bitcoin:master from l0rinc:l0rinc/optimize-xor changing 16 files +264 −93
  1. l0rinc commented at 7:11 am on October 24, 2024: contributor

    The recently merged XOR obfuscation introduced a minor serialization cost (seen during IBD benchmarking). This PR is meant to alleviate that.

    Changes in testing, benchmarking and implementation

    • Added new tests comparing randomized inputs against a trivial implementation and performing roundtrip checks with random chunks.
    • Updated the benchmark to reflect realistic scenarios by capturing every call of util::Xor for the first 860k blocks, creating a model with the same data distribution. An additional benchmark checks the effect of short-circuiting XOR when the key is zero, ensuring no speed regression occurs when the obfuscation feature is disabled.
    • Optimized the Xor function to process in batches (64/32/16/8 bits instead of per-byte).
    • Migrated remaining std::vector<std::byte>(8) values to uint64_t.

    Reproducer and assembly

    Memory alignment is handled via std::memcpy, optimized out on tested platforms (see https://godbolt.org/z/dcxvh6abq):

    • Clang (x86-64) - 32 bytes/iter using SSE vector operations
    • GCC (x86-64) - 16 bytes/iter using unrolled 64-bit XORs
    • RISC-V (32-bit) - 8 bytes/iter using load/XOR/store sequence
    • s390x (big-endian) - 64 bytes/iter with unrolled 8-byte XORs

    Endianness

    The only endianness issue was with bit rotation, intended to realign the key if obfuscation halted before full key consumption. Elsewhere, memory is read, processed, and written back in the same endianness, preserving byte order. Since CI lacks a big-endian machine, testing was done locally via Docker.

    0brew install podman pigz
    1softwareupdate --install-rosetta
    2podman machine init
    3podman machine start
    4docker run --platform linux/s390x -it ubuntu:latest /bin/bash
    5    apt update && apt install -y git build-essential cmake ccache pkg-config libevent-dev libboost-dev libssl-dev libsqlite3-dev && \
    6    cd /mnt && git clone https://github.com/bitcoin/bitcoin.git && cd bitcoin && git remote add l0rinc https://github.com/l0rinc/bitcoin.git && git fetch --all && git checkout l0rinc/optimize-xor && \
    7    cmake -B build && cmake --build build --target test_bitcoin -j$(nproc) && \
    8    ./build/src/test/test_bitcoin --run_test=streams_tests
    

    Performance

    0   cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
    1&& cmake --build build -j$(nproc) \
    2&& build/src/bench/bench_bitcoin -filter='XorHistogram|AutoFileXor' -min-time=10000
    

    C++ compiler …………………….. AppleClang 16.0.0.16000026

    Before:

    ns/byte byte/s err% total benchmark
    1.46 684,148,318.55 0.9% 10.17 AutoFileXor
    21.57 46,357,868.35 0.3% 10.93 XorHistogram

    After:

    ns/byte byte/s err% total benchmark
    0.14 7,071,717,185.75 0.9% 11.25 AutoFileXor
    9.25 108,126,826.70 0.4% 11.00 XorHistogram

    C++ compiler …………………….. GNU 13.2.0

    Before:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    1.84 544,924,597.89 0.1% 9.20 3.49 2.639 1.03 0.1% 11.03 AutoFileXor
    35.74 27,980,106.15 0.1% 110.78 121.32 0.913 12.70 5.6% 11.70 XorHistogram

    After:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    0.54 1,863,647,178.39 0.0% 0.00 0.00 0.743 0.00 2.1% 11.06 AutoFileXor
    13.73 72,823,051.75 0.0% 24.45 46.61 0.525 5.96 13.4% 11.04 XorHistogram

    i.e. roughly 2-2.5x faster for a representative sample.

    The 860k block profile contains a lot of very big arrays (96'233 separate sizes, biggest was 3'992'470 bytes long) - a big departure from the previous 400k and 700k blocks (having 1500 sizes, biggest was 9319 bytes long).

    The performance characteristics are also quite different, now that we have more and bigger byte arrays:

    C++ compiler …………………….. AppleClang 16.0.0.16000026

    Before:

    ns/byte byte/s err% total benchmark
    1.04 961,221,881.72 0.2% 93.30 XorHistogram

    After:

    ns/byte byte/s err% total benchmark
    0.03 28,649,147,111.58 0.2% 6.78 XorHistogram

    i.e. ~30x faster with Clang at processing the data with representative histograms.

    C++ compiler …………………….. GNU 13.2.0

    Before:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    0.97 1,032,916,679.87 0.0% 9.01 3.29 2.738 1.00 0.0% 86.58 XorHistogram

    After:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    0.10 10,369,097,976.62 0.0% 0.32 0.33 0.985 0.06 0.6% 8.63 XorHistogram

    i.e. ~10x faster with GCC at processing the data with representative histograms.

    Before:

    ns/op op/s err% total benchmark
    3,146,553.30 317.81 0.1% 11.02 ReadBlockFromDiskTest
    1,014,134.52 986.06 0.2% 11.00 ReadRawBlockFromDiskTest

    After:

    ns/op op/s err% total benchmark
    2,522,199.92 396.48 0.1% 11.03 ReadBlockFromDiskTest
    64,484.30 15,507.65 0.2% 10.58 ReadRawBlockFromDiskTest

    Also visible on https://corecheck.dev/bitcoin/bitcoin/pulls/31144

    The obfuscation’s effect on IBD seems’ to be about 3%, see: #31144 (comment)

  2. DrahtBot commented at 7:11 am on October 24, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31144.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK hodlinator

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #30901 (cmake: Revamp handling of data files for {test,bench}_bitcoin targets by hebasto)
    • #29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)
    • #28792 (Embed default ASMap as binary dump header file by fjahr)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. l0rinc force-pushed on Oct 24, 2024
  4. DrahtBot added the label CI failed on Oct 24, 2024
  5. in src/streams.h:40 in 10b9e68768 outdated
    41-    key_offset %= key.size();
    42 
    43-    for (size_t i = 0, j = key_offset; i != write.size(); i++) {
    44-        write[i] ^= key[j++];
    45+    if (size_t remaining = write.size() - i; key.size() == 8 && remaining >= 8) { // Xor in 64-bit chunks
    46+        const auto key64 = *std::bit_cast<uint64_t*>(key.data());
    


    maflcko commented at 8:27 am on October 24, 2024:
    I fail to see how this is not UB. This is identical to #30349 (review)

    maflcko commented at 8:54 am on October 24, 2024:

    Isn’t this what we’re doing in CScript as well https://github.com/bitcoin/bitcoin/blob/master/src/script/script.h#L496 ?

    no? value_type is unsigned char (an 8-bit integer type) and this one here is uint64_t (an 64-bit integer type).


    l0rinc commented at 8:55 am on October 24, 2024:
    Yes, just saw it, my mistake

    l0rinc commented at 9:17 am on October 24, 2024:
    Replaced it with memcpy and it looks like the compiler successfully simplifies it to proper vector instructions: https://godbolt.org/z/Koscjconz
  6. in src/streams.h:42 in 10b9e68768 outdated
    43-    for (size_t i = 0, j = key_offset; i != write.size(); i++) {
    44-        write[i] ^= key[j++];
    45+    if (size_t remaining = write.size() - i; key.size() == 8 && remaining >= 8) { // Xor in 64-bit chunks
    46+        const auto key64 = *std::bit_cast<uint64_t*>(key.data());
    47+        const auto size64 = remaining / 8;
    48+        for (auto& write64 : std::span(std::bit_cast<uint64_t*>(write.data() + i), size64)) {
    


    maflcko commented at 8:27 am on October 24, 2024:
    same

    maflcko commented at 8:37 am on October 24, 2024:

    CI seems to agree:

    0/usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/stl_iterator.h:1100:16: runtime error: reference binding to misaligned address 0x7f10961d9084 for type 'unsigned long', which requires 8 byte alignment
    10x7f10961d9084: note: pointer points here
    2  94 8e 20 eb 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
    3              ^ 
    4    [#0](/bitcoin-bitcoin/0/) 0x55780d85ab85 in __gnu_cxx::__normal_iterator<unsigned long*, std::span<unsigned long, 18446744073709551615ul>>::operator*() const /usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/stl_iterator.h:1100:9
    5    [#1](/bitcoin-bitcoin/1/) 0x55780d85ab85 in util::Xor(Span<std::byte>, Span<std::byte const>, unsigned long) ci/scratch/build-x86_64-pc-linux-gnu/src/test/./src/streams.h:42:28
    

    https://github.com/bitcoin/bitcoin/actions/runs/11495063168/job/31993791381?pr=31144#step:7:4647


    l0rinc commented at 8:44 am on October 24, 2024:
    Thanks, I’ll investigate. I assumed there will be more to check, that’s why it’s still a draft.
  7. maflcko commented at 8:27 am on October 24, 2024: member

    I think your example may be a bit skewed? It shows how much time is spent when deserializing a CScript from a block file. However, block files contain full blocks, where many (most?) of the writes are single bytes (or 4 bytes), see #30833 (comment). Thus, it would be useful to know what the overall end-to-end performance difference is. Also taking into account the utxo db.

    If you want the micro-benchmark to be representative, I’d presume you’d have to mimic the histogram of the sizes of writes. Just picking one (1024, or 1004), which is never hit in reality, and then optimizing for that may be misleading.

  8. l0rinc force-pushed on Oct 24, 2024
  9. l0rinc force-pushed on Oct 24, 2024
  10. l0rinc commented at 10:02 am on October 24, 2024: contributor

    where many (most?) of the writes are single bytes (or 4 bytes)

    Thanks, I’ve extended your previous benchmarks with both Autofile serialization and very small vectors. I will also run a reindex of 400k blocks before and after to see if the effect is measurable or not.

  11. in src/streams.h:44 in 6ae466bf11 outdated
    42+    if (key.size() == 8 && write.size() - i >= 8) { // Xor in 64-bit chunks
    43+        uint64_t key64;
    44+        std::memcpy(&key64, key.data(), 8);
    45+        for (; i <= write.size() - 8; i += 8) {
    46+            uint64_t write64;
    47+            std::memcpy(&write64, write.data() + i, 8);
    


    laanwj commented at 10:41 am on October 24, 2024:

    i have a hard time believing this will make a large difference, especially with the two memcpys involved On modern CPUs, ALU operations (especially bitwise ones) are so fast compared to any kind of memory access. And this isn’t some advanced crypto math, it’s one Xor operation per word with a fixed key.

    Could avoid the memcpys if the code takes memory alignment into account, but that makes it even more complex. Not sure the pros/cons work out here.


    l0rinc commented at 10:46 am on October 24, 2024:
    The speedup comes from the vectorized operations, i.e. doing 64 bit xor instead of byte-by-byte xor (memcpy seems to be eliminated on 64 bit architectures successfully), see https://godbolt.org/z/Koscjconz

    laanwj commented at 11:05 am on October 24, 2024:
    Right, that’s trivial for x86_64. Let’s also check for architectures that do require alignment for 64-bit reads and writes, like RISC-V.

    l0rinc commented at 12:06 pm on October 24, 2024:
    Added RISC-V compiler to https://godbolt.org/z/n5rMeYeas - where it seems to my untrained eyes that it uses two separate 32 bit xors to emulate the 64 bit operation (but even if it’s byte-by-byte on 32 bit processors, that’s still the same as what it was before on 64 bit CPUs, right?).

    l0rinc commented at 10:01 pm on October 26, 2024:

    I’ve replaced all vector keys with 64 bit ints

    Edit: Memory alignment is handled via std::memcpy, optimized out on tested platforms (see godbolt.org/z/dcxvh6abq):

    • Clang (x86-64) - 32 bytes/iter using SSE vector operations
    • GCC (x86-64) - 16 bytes/iter using unrolled 64-bit XORs
    • RISC-V (32-bit) - 8 bytes/iter using load/XOR/store sequence
    • s390x (big-endian) - 64 bytes/iter with unrolled 8-byte XORs

    (please validate, my assembly knowledge is mostly academic)

  12. in src/streams.h:51 in 6ae466bf11 outdated
    59-        // way instead of doing a %, which would effectively be a division
    60-        // for each byte Xor'd -- much slower than need be.
    61-        if (j == key.size())
    62-            j = 0;
    63+    for (size_t j = 0; i < write.size(); ++i, ++j) {
    64+        write[i] ^= key[j % key.size()];
    


    laanwj commented at 10:42 am on October 24, 2024:
    Please avoid % (especially with a dynamic value) in an inner loop. It’s essentially a division operation and those are not cheap.

    l0rinc commented at 10:45 am on October 24, 2024:
    Thanks for the hint, I deliberately removed that (please check the commit messages for details), since these are optimized away. Also, this is just the leftover part, so for key of length 8 (the version used in most places) this will have 7 iterations at most. Can you see any difference with any of the benchmarks?

    laanwj commented at 11:02 am on October 24, 2024:

    It’s only up to 7 iterations (assuming the key size is 8), sure, youre’re right.

    But ok, yea, i’m a bit divided about relying on specific non-trivial things being optimized out, makes the output very dependent on specific compiler decisions (which may be fickle in some cases).


    l0rinc commented at 11:05 am on October 24, 2024:
    Often the simplest code gets optimized most, since it’s more predictable. Would you like me to extend the test or benchmark suite or try something else to make sure we’re comfortable with the change?

    laanwj commented at 11:10 am on October 24, 2024:
    No, it’s fine. It just gives me goosebumps seeing code like this, but if it doesn’t affect the benchmark and no one else cares then it doesn’t matter.

    l0rinc commented at 12:09 pm on October 24, 2024:
    To get rid of the goosebumps I’m handling the remaining 4 bytes as a single 32 bit xor now, so the final loop (when keys are 8 bytes long, which is mostly the case for us, I think) does 3 iterations at most. So even if it’s not optimized away, we should be fine doing 3 divisions by a nice round number like 8.

    maflcko commented at 12:17 pm on October 24, 2024:

    divisions by a nice round number like 8.

    I don’t think the compiler knows the number here, so can’t use it to optimize the code based on it.


    l0rinc commented at 12:19 pm on October 24, 2024:
    Is that important for max 3 divisions?

    maflcko commented at 7:33 pm on October 24, 2024:

    Yes. At least for me locally, but I am getting widely different bench results anyway: #31144 (comment)

    With this one reverted, XorSmall is back to being just slightly slower than current master for me.


    l0rinc commented at 8:37 pm on October 24, 2024:

    Usually these optimizations concentrate on the measurable parts based on the profiling results that I’m getting during reindexing or IBD. Obfuscating a single bit (i.e. XorSmall) wasn’t my focus, it’s already very fast, didn’t seem like the bottleneck. Would you like me to concentrate on that scenario instead? Or would it make more sense to serialize a block and use that as the basis for the benchmarks?

    C++ compiler …………………….. AppleClang 16.0.0.16000026

    Before:

    ns/byte byte/s err% total benchmark
    1.99 501,740,412.05 0.5% 10.27 AutoFileXor
    1.24 807,597,761.92 0.0% 11.01 Xor
    1.29 776,238,564.27 0.0% 10.59 XorSmall

    After:

    ns/byte byte/s err% total benchmark
    0.73 1,364,622,484.81 0.9% 8.78 AutoFileXor
    0.02 40,999,383,920.82 0.0% 11.00 Xor
    0.54 1,862,525,472.57 0.0% 11.00 XorSmall

    C++ compiler …………………….. GNU 12.2.0

    Before:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    1.60 624,562,742.35 0.2% 9.20 3.57 2.579 1.03 0.1% 10.61 AutoFileXor
    0.91 1,102,205,071.31 0.0% 9.02 3.26 2.763 1.00 0.1% 11.00 Xor
    1.23 811,876,565.33 0.1% 14.60 4.43 3.295 1.80 0.0% 10.56 XorSmall

    After:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    0.74 1,346,798,809.87 0.1% 0.72 0.47 1.531 0.16 0.2% 11.02 AutoFileXor
    0.09 11,450,472,586.50 0.0% 0.59 0.31 1.882 0.14 1.9% 10.82 Xor
    5.65 177,092,223.60 0.1% 22.00 20.31 1.083 4.80 0.0% 10.99 XorSmall

    maflcko commented at 9:11 pm on October 24, 2024:

    Would you like me to concentrate on that scenario instead? Or would it make more sense to serialize a block and use that as the basis for the benchmarks?

    Well no. I think this has been mentioned previously. Generally, optimizing for micro benchmarks may not yield results that are actually meaningful or visible for end-users, because the benchmarks capture only a very specific and narrow view. Optimizing for one could even make the code slower for another (as observed above). Adding a bench for the block couldn’t hurt, but I haven’t checked how representative it is. If such a bench represents the IBD behavior, it would be ideal. (There already is a block in the hex bench data, which could be used)

    Usually these optimizations concentrate on the measurable parts based on the profiling results that I’m getting during reindexing or IBD

    Yes, that is more useful. It would be good to share the number you got. Because the commit message simply claims that no benefit was found (“The if (j == key.size()) optimization wasn’t kept since the benchmarks couldn’t show any advantage anymore”).

    XorSmall

    Looks like you can reproduce the slowdown. I wonder if it is correlated with the use of libc++ vs libstdc++


    l0rinc commented at 11:35 am on October 25, 2024:

    It would be good to share the number you got

    The reindex-chainstate until 600k, 2 runs just finished - comparing master against the 64/32 bit packing (current state) on Linux (with GCC, showing the above inconsistency).

    0hyperfine \
    1--runs 2 \
    2--show-output \
    3--export-json /mnt/my_storage/reindex-xor.json \
    4--parameter-list COMMIT dea7e2faf1bc48f96741ef
    584e25e6f47cefd5a92,353915bae14b9704a209bc09b021d3dd2ee11cf2 \
    6--prepare 'cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DI
    7NSTALL_MAN=OFF && cmake --build build -j$(nproc)' \
    8'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=500 -printtoconsole=0 -reindex-chainstate -connect=0'
    

    Before:

    • 3.554 hours
    • 3.567 hours

    After:

    • 3.523 hours
    • 3.527 hours
    0Benchmark 1: COMMIT=dea7e2faf1bc48f96741ef84e25e6f47cefd5a92 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=500 -printtoconsole=0 -reindex-chains
    1tate -connect=0
    2  Time (mean ± σ):     12819.367 s ± 35.155 s    [User: 11992.168 s, System: 2509.200 s]
    3  Range (min … max):   12794.508 s … 12844.225 s    2 runs
    4
    5Benchmark 2: COMMIT=353915bae14b9704a209bc09b021d3dd2ee11cf2 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=500 -printtoconsole=0 -reindex-chains
    6tate -connect=0
    7  Time (mean ± σ):     12685.350 s ± 19.878 s    [User: 11918.349 s, System: 2523.819 s]
    8  Range (min … max):   12671.295 s … 12699.406 s    2 runs
    

    Reindexing is a lot more stable than IBD (as seen from multiple measurements), showing a consistent 1% speedup. Not earth-shattering, but at least this way the obfuscation isn’t causing a speed regression anymore.


    Adding a bench for the block couldn’t hurt, but I haven’t checked how representative it is

    Let’s find out!


    l0rinc commented at 10:03 pm on October 26, 2024:
    I’ve change the usages to avoid std::vector keys, this way GCC and Clang both agree that the new results are faster (even though clang manages to compile to 32 byte SIMD, while GCC only to 16 bytes per iteration, see #31144 (review))
  13. DrahtBot removed the label CI failed on Oct 24, 2024
  14. l0rinc force-pushed on Oct 24, 2024
  15. l0rinc force-pushed on Oct 24, 2024
  16. l0rinc renamed this:
    optimization: pack util::Xor into 64 bit chunks instead of doing it byte-by-byte
    optimization: pack util::Xor into 64/32 bit chunks instead of doing it byte-by-byte
    on Oct 24, 2024
  17. l0rinc marked this as ready for review on Oct 24, 2024
  18. laanwj added the label Block storage on Oct 24, 2024
  19. maflcko commented at 7:14 pm on October 24, 2024: member

    It would be good to explain the jpegs in the description, or even remove them. They will be excluded from the merge commit and aren’t shown, unless GitHub happens to be reachable and online. Are they saying that IBD was 4% faster? Also, I think they were created with the UB version of this pull, so may be outdated either way?

    I did a quick check on my laptop and it seems the XorSmall (1+4 bytes) is slower with this pull. The Xor (modified to check 40 bytes) was twice as fast. Overall, I’d expect it to be slower on my machine, due to the histogram of real data showing more small byte writes than long ones, IIRC.

    I can try to bench on another machine later, to see if it makes a difference.

    Can you clarify what type of machine you tested this on?

  20. l0rinc commented at 8:39 pm on October 24, 2024: contributor

    Are they saying that IBD was 4% faster?

    That’s what I’m measuring currently, but I don’t expect more than 2% difference here.

    Also, I think they were created with the UB version of this pull, so may be outdated either way?

    Benchmarks indicated that the 64 bit compiled result was basically the same.

    Overall, I’d expect it to be slower on my machine, due to the histogram of real data showing more small byte writes than long ones, IIRC.

    I’ll investigate, thanks.

    Posting the perf here for reference: Reindexing until 300k blocks reveals that XOR usage was reduced:

  21. in src/test/streams_tests.cpp:338 in a3dc138798 outdated
    296@@ -270,7 +297,7 @@ BOOST_AUTO_TEST_CASE(streams_buffered_file)
    297         BOOST_CHECK(false);
    298     } catch (const std::exception& e) {
    299         BOOST_CHECK(strstr(e.what(),
    300-                        "Rewind limit must be less than buffer size") != nullptr);
    301+                           "Rewind limit must be less than buffer size") != nullptr);
    


    hodlinator commented at 12:41 pm on October 25, 2024:
    Seems like prior author just rounded off to some not too unreasonable tab-indentation (efd2474d17098c754367b844ec646ebececc7c74). Function isn’t touched in this PR so should probably resist touching here and below.

    l0rinc commented at 2:18 pm on October 25, 2024:
    Seems it was a deliberate formatting, so I’ll revert. Will push after I have the block serialization benchmark working.

    l0rinc commented at 9:58 pm on October 26, 2024:
    Ended up modifying a lot more in the latest push, so this line was formatted again
  22. in src/test/streams_tests.cpp:35 in a3dc138798 outdated
    30+        const size_t write_size = rng.randrange(100);
    31+        const size_t key_offset = rng.randrange(key_size + 2);
    32+
    33+        std::vector key(rng.randbytes<std::byte>(key_size));
    34+        std::vector expected(rng.randbytes<std::byte>(write_size));
    35+        std::vector actual(expected);
    


    hodlinator commented at 12:46 pm on October 25, 2024:

    Experimented with changed to brace-initialization which uncovered some slight narrowing/widening occurring. (Thought I had an angle for making the code more robust in a more material way but that attempt failed).

     0    auto expected_xor{[](Span<std::byte> write, Span<const std::byte> key, size_t key_offset) {
     1        if (key.size()) {
     2            for (auto& b : write) {
     3                b ^= key[key_offset++ % key.size()];
     4            }
     5        }
     6    }};
     7
     8    FastRandomContext rng{false};
     9    for (int test = 0; test < 100; ++test) {
    10        const int key_size{rng.randrange(10)};
    11        const int write_size{rng.randrange(100)};
    12        const int key_offset{rng.randrange(key_size + 2)};
    13
    14        std::vector key{rng.randbytes<std::byte>(key_size)};
    15        std::vector expected{rng.randbytes<std::byte>(write_size)};
    16        std::vector actual{expected};
    

    l0rinc commented at 2:18 pm on October 25, 2024:
    Seems a bit excessive for a test, but ended up changing it to e.g. const size_t write_size{rng.randrange(100U)};. Will push a bit later.
  23. in src/test/streams_tests.cpp:19 in a3dc138798 outdated
    13@@ -14,6 +14,33 @@ using namespace std::string_literals;
    14 
    15 BOOST_FIXTURE_TEST_SUITE(streams_tests, BasicTestingSetup)
    16 
    17+BOOST_AUTO_TEST_CASE(xor_bytes)
    18+{
    19+    auto expected_xor = [](Span<std::byte> write, Span<const std::byte> key, size_t key_offset) {
    


    hodlinator commented at 1:12 pm on October 25, 2024:
    Might as well use std::span for new code.

    l0rinc commented at 2:18 pm on October 25, 2024:
    Indeed!
  24. hodlinator commented at 2:02 pm on October 25, 2024: contributor

    Concept ACK a3dc138798e2f2c7aa1c9b37633c16c1b523a251

    Operating on CPU words rather than individual bytes. :+1:

    Not entirely clear to me from #31144 (review) whether the optimizer is able to use SIMD. Guess picking through the binary of a GUIX-build would give a definitive answer.

    The verbosity of std::memcpy hurts readability but alignment issues are real.

  25. l0rinc marked this as a draft on Oct 25, 2024
  26. l0rinc force-pushed on Oct 26, 2024
  27. l0rinc force-pushed on Oct 26, 2024
  28. l0rinc force-pushed on Oct 26, 2024
  29. DrahtBot commented at 9:58 pm on October 26, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32109998785

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  30. DrahtBot added the label CI failed on Oct 26, 2024
  31. l0rinc force-pushed on Oct 26, 2024
  32. l0rinc renamed this:
    optimization: pack util::Xor into 64/32 bit chunks instead of doing it byte-by-byte
    optimization: change XOR obfuscation key from `std::vector<std::byte>(8)` to `uint64_t`
    on Oct 26, 2024
  33. l0rinc force-pushed on Oct 27, 2024
  34. DrahtBot removed the label CI failed on Oct 27, 2024
  35. l0rinc force-pushed on Oct 27, 2024
  36. l0rinc force-pushed on Oct 27, 2024
  37. l0rinc marked this as ready for review on Oct 27, 2024
  38. l0rinc force-pushed on Oct 29, 2024
  39. l0rinc force-pushed on Oct 29, 2024
  40. DrahtBot added the label CI failed on Oct 29, 2024
  41. DrahtBot commented at 12:49 pm on October 29, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32217592364

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  42. DrahtBot removed the label CI failed on Oct 29, 2024
  43. hodlinator commented at 12:40 pm on October 31, 2024: contributor

    The CI failure in https://github.com/bitcoin/bitcoin/runs/32217592364 might come from a bad rebase reconciliation with master?

    0[12:44:30.723] Duplicate include(s) in src/bench/xor.cpp:
    1[12:44:30.723] #include <cstddef>
    2[12:44:30.723] #include <span.h>
    3[12:44:30.723] #include <streams.h>
    4[12:44:30.723] #include <random.h>
    5[12:44:30.723] #include <vector>
    6[12:44:30.723] #include <bench/bench.h>
    7[12:44:30.723] 
    8[12:44:30.728] ^---- ⚠️ Failure generated from lint-includes.py
    
  44. l0rinc commented at 12:59 pm on October 31, 2024: contributor

    The CI failure in bitcoin/bitcoin/runs/32217592364 might come from a bad rebase reconciliation with master?

    That’s an old build, right? Latest Lint seems fine

  45. in src/dbwrapper.cpp:259 in 8696184c2f outdated
    263-        // Initialize non-degenerate obfuscation if it won't upset
    264-        // existing, non-obfuscated data.
    265-        std::vector<unsigned char> new_key = CreateObfuscateKey();
    266+    obfuscate_key = 0; // needed for unobfuscated Read
    267+    std::vector<unsigned char> obfuscate_key_vector(OBFUSCATE_KEY_NUM_BYTES, '\000');
    268+    if (bool key_exists = Read(OBFUSCATE_KEY_KEY, obfuscate_key_vector); !key_exists && params.obfuscate && IsEmpty()) {
    


    hodlinator commented at 1:09 pm on October 31, 2024:

    While key_exists is only created for this ìf`-block, there are other conditions involved, and we test for the negation of the value, so I find it less surprising to revert to the previous approach.

    0    bool key_exists = Read(OBFUSCATE_KEY_KEY, obfuscate_key_vector);
    1    if (!key_exists && params.obfuscate && IsEmpty()) {
    

    l0rinc commented at 2:57 pm on October 31, 2024:
    This pattern has been used before, I meant to narrow the scope of the var here. If you feel strongly about it, I’ll separate them.

    hodlinator commented at 3:09 pm on October 31, 2024:

    This pattern has been used before

    In the example you give, the variable is used in a more “positive” sense. Here we test the negated value, which is part of what makes it jarring.

    One could maybe improve readability by moving the negation and renaming:

    0    if (bool missing_key = !Read(OBFUSCATE_KEY_KEY, obfuscate_key_vector); missing_key && params.obfuscate && IsEmpty()) {
    

    I meant to narrow the scope of the var here.

    That’s why I was referring to the if-block.

    If you feel strongly about it, I’ll separate them.

    Yes please, either that or my negation move+rename suggestion.


    l0rinc commented at 3:13 pm on October 31, 2024:
    Done both, thanks, good observation about the negation

    hodlinator commented at 3:48 pm on October 31, 2024:

    My point about moving the negation and changing the name made more sense in the context of keeping it inside the if-block. If you are open to moving it out, I’d say it’s better to keep the original key_exists name and original negation to avoid the churn and make it easier to review.

    (Realized another reason for not having it inside the if-block is that we are mutating obfuscate_key_vector, which is used after the block).


    l0rinc commented at 5:43 pm on October 31, 2024:
    I prefer the new key_missing part, it did confuse me once during development as well
  46. in src/node/blockstorage.cpp:1180 in 8696184c2f outdated
    1173@@ -1174,7 +1174,11 @@ static auto InitBlocksdirXorKey(const BlockManager::Options& opts)
    1174         };
    1175     }
    1176     LogInfo("Using obfuscation key for blocksdir *.dat files (%s): '%s'\n", fs::PathToString(opts.blocks_dir), HexStr(xor_key));
    1177-    return std::vector<std::byte>{xor_key.begin(), xor_key.end()};
    1178+    assert(xor_key.size() == 8);
    1179+    uint64_t key;
    1180+    std::memcpy(&key, xor_key.data(), 8);
    1181+    xor_key.fill(std::byte{0});
    


    hodlinator commented at 1:16 pm on October 31, 2024:

    I find it more useful to (static) assert that the std::array size matches the uint64 size directly. Also don’t see a point in zeroing out the local variable before returning?

    0    uint64_t key;
    1    static_assert(xor_key.size() == sizeof(key));
    2    std::memcpy(&key, xor_key.data(), sizeof(key));
    

    l0rinc commented at 2:59 pm on October 31, 2024:
    I’ve added the fills to make sure we’re not using them after conversion anymore. What would be the advantage of the static asserts? I don’t mind removing these failsafes if you think they’re redundant or noisy.

    l0rinc commented at 3:15 pm on October 31, 2024:
    Removed in the end, not that important
  47. in src/bench/xor.cpp:8 in 8696184c2f outdated
    2@@ -3,22 +3,1283 @@
    3 // file COPYING or https://opensource.org/license/mit/.
    4 
    5 #include <bench/bench.h>
    6+#include <cmath>
    7+#include <cstddef>
    8+#include <map>
    


    fanquake commented at 1:16 pm on October 31, 2024:
    You can keep the std lib headers separated from our own headers.

    l0rinc commented at 1:55 pm on October 31, 2024:
    Done, anything else?
  48. maflcko commented at 1:18 pm on October 31, 2024: member

    Taking a step back, I wonder if this is worth it. IIRC it gives a +1% speedup when run on spinning storage, so it seems a higher speedup is possibly visible on faster, modern storage. However, it would be good if any IO delay was taken out of IBD completely, so that the speed of storage and the speed of XOR is largely irrelevant.

    I haven’t looked, but this may be possible by asking the next block the be read into memory in the background, as soon as work on the current block begins.

    Something similar is done in net_processing, where the (presumed) current block is read into memory, and kept there, to avoid having to read it from storage again in validation. See https://github.com/bitcoin/bitcoin/blob/f07a533dfcb172321972e5afb3b38a4bd24edb87/src/net_processing.cpp#L823

  49. in src/node/blockstorage.cpp:1177 in 23fc898514 outdated
    1173@@ -1174,6 +1174,7 @@ static auto InitBlocksdirXorKey(const BlockManager::Options& opts)
    1174         };
    1175     }
    1176     LogInfo("Using obfuscation key for blocksdir *.dat files (%s): '%s'\n", fs::PathToString(opts.blocks_dir), HexStr(xor_key));
    1177+    assert(xor_key.size() == 8);
    


    hodlinator commented at 1:20 pm on October 31, 2024:
    Don’t see much point in adding the assert here in 23fc898514bf9696facbaff65251b62c362d214e were we still only have a fixed-size std::array with the ~asserted~ fixed size of 8. Seems sufficient with the assert in the BlockManager-ctor.

    l0rinc commented at 3:13 pm on October 31, 2024:
    Simplified, thanks
  50. l0rinc commented at 1:24 pm on October 31, 2024: contributor

    Taking a step back, I wonder if this is worth it. IIRC it gives a +1% speedup

    That’s not the main point, rather that we’re storing the key in a value that can be short-circuited easily so that when key is 0 (i.e. xor is NOOP), we can skip it. Previously this would have only been possible by checking each byte of the key. It’s also a lot cleaner to store it in a primitive instead, which supports xor natively. Xor comes up in every profiling I do, we shouldn’t have a regression because of #28207 - this PR solves that.

  51. in src/test/streams_tests.cpp:262 in 850214ffd9 outdated
    261@@ -235,7 +262,7 @@ BOOST_AUTO_TEST_CASE(streams_serializedata_xor)
    262     // Single character key
    


    hodlinator commented at 1:29 pm on October 31, 2024:
    In 850214ffd9f56e887a18d0428d5881e6c1ee8652: Single/Multi character key comments don’t make sense inside of this commit.

    l0rinc commented at 3:13 pm on October 31, 2024:
    Removed, thanks

    hodlinator commented at 3:39 pm on October 31, 2024:
    (Should be done in the initial commit which invalidates the comments IMO).
  52. maflcko commented at 1:39 pm on October 31, 2024: member

    we shouldn’t have a regression because of #28207 - this PR solves that.

    It is not possible to do XOR without any cost at all. There will always be an overhead and I think calling #28207 a regression and this change a “fix” is not entirely accurate. This change reduces the overhead, according to the benchmarks.

    That’s not the main point

    The pull request title starts with “optimization”, so I got the impression that a speedup is the main point.

    The reason that std::vector was picked is that switching to larger sizes is possible. However, if there was a need to do that, XOR would likely not be sufficient anyway. So limiting to 8 bytes fixed at compile time seems reasonable.

    I am mostly saying that any speedup here may not be visible at all if IO is completely taken out of the critical path, but I haven’t looked into that in detail.

  53. l0rinc commented at 1:52 pm on October 31, 2024: contributor

    change a “fix” is not entirely accurate

    when xor is disabled we’re not xor-ing now at all. Previously we did xor, so this change brings back the previous behavior when xor is not needed.

    The pull request title starts with “optimization”, so I got the impression that a speedup is the main point.

    Yes, please see the updated description with the benchmarks: #31144#issue-2610689777

    may not be visible at all if IO is completely taken out of the critical path

    I’d argue the current implementation is slightly simpler (i.e. xor is stored and performed natively and can be disabled) and faster (2x for a representative dataset).

  54. l0rinc force-pushed on Oct 31, 2024
  55. in src/streams.cpp:14 in 4437d4379c outdated
    10@@ -11,6 +11,7 @@
    11 AutoFile::AutoFile(std::FILE* file, std::vector<std::byte> data_xor)
    12     : m_file{file}, m_xor{std::move(data_xor)}
    13 {
    14+    assert(m_xor.size() == 8);
    


    hodlinator commented at 2:15 pm on October 31, 2024:
    Guess the point of adding this assert is to prove that it doesn’t break anything before switching to uint64?

    l0rinc commented at 3:01 pm on October 31, 2024:
    yes
  56. in src/streams.h:408 in 4437d4379c outdated
    404@@ -393,7 +405,7 @@ class AutoFile
    405     std::optional<int64_t> m_position;
    406 
    407 public:
    408-    explicit AutoFile(std::FILE* file, std::vector<std::byte> data_xor={});
    409+    explicit AutoFile(std::FILE* file, std::vector<std::byte> data_xor = {std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}});
    


    hodlinator commented at 2:27 pm on October 31, 2024:
    In 4437d4379c42a7b87bd01ad5ea6c450a732f4f95: Less verbose: {std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}, std::byte{0x00}} => {8, std::byte{0x00}}

    l0rinc commented at 3:01 pm on October 31, 2024:
    it’s removed in the next commit

    l0rinc commented at 3:13 pm on October 31, 2024:
    But done anyway
  57. in src/node/mempool_persist.cpp:189 in 33625e9d53 outdated
    188-            file.SetXor(xor_key);
    189+            std::vector<std::byte> xor_key_vector(8);
    190+            FastRandomContext{}.fillrand(xor_key_vector);
    191+            file << xor_key_vector;
    192+
    193+            uint64_t m_xor;
    


    hodlinator commented at 2:46 pm on October 31, 2024:
    Should not use m_-prefix for non-member variables.

    l0rinc commented at 3:13 pm on October 31, 2024:
    You’re right, I’ve used xor_key in the same case before
  58. l0rinc force-pushed on Oct 31, 2024
  59. hodlinator commented at 4:00 pm on October 31, 2024: contributor

    (Reviewed a5cad729c76cafa047a2b1897595669ae9b2b0d5)

    Since my previous Concept ACK, the PR was changed to switch the xor key more completely to uint64_t. Before the PR, we were already using fixed-size of 8 bytes for the obfuscation value in the file formats, so changing the type to uint64_t shouldn’t be noticeable to users. :+1:

    Even if we could move reading and XOR-ing out of the hot path as suggested by maflcko, we might as well make use the CPU architectures we have. I would expect larger-sized XOR operations to have less overhead and energy waste (less heat).

  60. maflcko commented at 5:32 pm on October 31, 2024: member

    We could have used ReadLE64 to unify the byte order for keys and writable values, but that shouldn’t be necessary, since both have the same endianness locally that shouldn’t be affected by a byte-by-byte xor.

    The s390x unit tests fail:

     0./src/test/streams_tests.cpp(40): error: in "streams_tests/xor_bytes": check { expected.begin(), expected.end() } == { actual.begin(), actual.end() } has failed. 
     1Mismatch at position 0: � != |
     2Mismatch at position 1: Y != �
     3Mismatch at position 2: � != �
     4Mismatch at position 3: � != �
     5Mismatch at position 4: w != �
     6Mismatch at position 5: C != �
     7Mismatch at position 6:  != x
     8Mismatch at position 7: � != �
     9Mismatch at position 8: � != C
    10Mismatch at position 9: Y != �
    11Mismatch at position 10: � != �
    12Mismatch at position 11: , != �
    13Mismatch at position 12: � != R
    14Mismatch at position 13: 8 != �
    15Mismatch at position 14: � != �
    16Mismatch at position 15: � != �
    17Mismatch at position 16: � != l
    18Mismatch at position 17: � != n
    19Mismatch at position 18: � != �
    20Mismatch at position 19: t != B
    21Mismatch at position 20: ; != �
    22Mismatch at position 21: != �
    23Mismatch at position 22: � != �
    24Mismatch at position 23: � != �
    25Mismatch at position 24: k != �
    26Mismatch at position 25: � != Z
    27Mismatch at position 26: � != �
    28Mismatch at position 27: � != �
    29Mismatch at position 28: � != #
    30Mismatch at position 29: 8 != �
    31Mismatch at position 30: � != �
    32Mismatch at position 31: � != �
    33Mismatch at position 32: g != �
    34Mismatch at position 33: � != ^
    35Mismatch at position 34: � != �
    36ismatch at position 36: � != k
    37Mismatch at position 37: * != �
    38Mismatch at position 38: q != 
    39Mismatch at position 39: � != �
    40Mismatch at position 40: � != �
    41Mismatch at position 41: r != e
    42Mismatch at position 42: � != �
    
  61. l0rinc commented at 5:42 pm on October 31, 2024: contributor

    The s390x unit tests fail:

    I don’t know how to access that, is it part of CI? Does the test suite pass on it otherwise or was it just curiosity? Do you know the reason it fails, is my assumption incorrect that endianness should apply to both parts (key and value) the same way? Is the test wrong or the xor, should I change the test such that xor-ing twice should reveal the original data instead (while the intermediary part should not)?

  62. maflcko commented at 6:37 pm on October 31, 2024: member

    I don’t know how to access that, is it part of CI?

    It needs to be run manually. See https://github.com/bitcoin/bitcoin/tree/master/ci#running-a-stage-locally. (podman run --rm --privileged docker.io/multiarch/qemu-user-static --reset -p yes may be required to setup qemu-s390x, depending on your setup). Then something like MAKEJOBS="-j$(nproc)" FILE_ENV="./ci/test/00_setup_env_s390x.sh" ./ci/test_run_all.sh should run it.

    Does the test suite pass on it otherwise or was it just curiosity?

    Yes, it should pass on s390x. If not, that is a bug somewhere.

  63. l0rinc marked this as a draft on Oct 31, 2024
  64. l0rinc force-pushed on Nov 2, 2024
  65. l0rinc commented at 7:45 pm on November 2, 2024: contributor

    Thanks for the hints @maflcko, I was under the impression that big-endian tests were run automatically.

    Fix

    It seem that that std::rotr doesn’t take endianness into account, so the fix basically looks like:

    0size_t key_rotation = 8 * key_offset;
    1if constexpr (std::endian::native == std::endian::big) key_rotation *= -1;
    2return std::rotr(key, key_rotation);
    

    I’ve emulated the s390x behavior locally like this:

    0brew install podman pigz
    1softwareupdate --install-rosetta
    2podman machine init
    3podman machine start
    4docker run --platform linux/s390x -it ubuntu:22.04 /bin/bash
    5    apt update && apt install -y git build-essential cmake ccache pkg-config libevent-dev libboost-dev libssl-dev libsqlite3-dev && \
    6    cd /mnt && git clone https://github.com/bitcoin/bitcoin.git && cd bitcoin && git remote add l0rinc https://github.com/l0rinc/bitcoin.git && git fetch --all && git checkout l0rinc/optimize-xor && \
    7    cmake -B build && cmake --build build --target test_bitcoin -j$(nproc) && \
    8    ./build/src/test/test_bitcoin --run_test=streams_tests
    

    Changes

    • The change also includes an updated benchmarking suite with 700k blocks inspected for all usages of utils::Xor to make the benchmark representative:

    • Changed AutoFileXor benchmark to measure the regression of turning off obfuscation.

    • I’ve also updated all key derivations to vector-to-uint64 instead of generating the 64 bit key directly (it’s more consistent, more representative and helps with endianness).

    • Added a xor roundtrip test which applies the xor in random chunks, asserts that it differs from the original and reapplies it in different random chunks and asserts that it’s equivalent to the original.

    See the complete diff: https://github.com/bitcoin/bitcoin/compare/a5cad729c76cafa047a2b1897595669ae9b2b0d5..af778c5bdea4175f84e82b41f3b51c7f453d8c7e

  66. l0rinc marked this as ready for review on Nov 2, 2024
  67. l0rinc force-pushed on Nov 5, 2024
  68. l0rinc commented at 4:40 pm on November 5, 2024: contributor

    Updated the benchmark to 860k blocks (September 2024):

    This one contains a lot of very big arrays (96'233 separate sizes, biggest was 3'992'470 bytes long) - a big departure from the previous 400k and 700k blocks (having 1500 sizes, biggest was 9319 bytes long).

    The performance characteristics are also quite different, now that we have more and bigger byte arrays:

    C++ compiler …………………….. AppleClang 16.0.0.16000026

    Before:

    ns/byte byte/s err% total benchmark
    1.29 774,577,944.12 0.2% 115.99 XorHistogram

    After:

    ns/byte byte/s err% total benchmark
    0.04 26,411,646,837.32 0.2% 8.97 XorHistogram

    i.e. ~35x faster with Clang at processing the data with representative histograms.

    C++ compiler …………………….. GNU 13.2.0

    Before:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    0.97 1,032,916,679.87 0.0% 9.01 3.29 2.738 1.00 0.0% 86.58 XorHistogram

    After:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    0.10 10,369,097,976.62 0.0% 0.32 0.33 0.985 0.06 0.6% 8.63 XorHistogram

    i.e. ~10x faster with GCC at processing the data with representative histograms.

    Edit: note that I couldn’t use random byte generation for each benchmark value since it timed out on CI. I have replaced it with getting subsets of a single big random vector.

  69. l0rinc force-pushed on Nov 5, 2024
  70. DrahtBot added the label CI failed on Nov 5, 2024
  71. DrahtBot commented at 4:49 pm on November 5, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32549559124

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  72. l0rinc force-pushed on Nov 6, 2024
  73. l0rinc force-pushed on Nov 6, 2024
  74. l0rinc force-pushed on Nov 7, 2024
  75. l0rinc force-pushed on Nov 8, 2024
  76. DrahtBot removed the label CI failed on Nov 8, 2024
  77. l0rinc commented at 10:59 am on November 11, 2024: contributor

    I ran a reindex-chainstate until 860k blocks (SSD, i7 CPU), before and after this change, 2 runs per commit. As stated in the previous comment, the latter blocks (>700k) seem to contain a lot of very big vectors where the new algorithm shines.

    0hyperfine   \
    1--runs 2   \
    2--export-json /mnt/my_storage/reindex-chainstate-obfuscation-overhead.json  \
    3--parameter-list COMMIT 866f4fa521f6932162570d6531055cc007e3d0cd,3efc72ff7cbdfb788b23bf4346e29ba99362c120,08db1794647c37f966c525411f931a4a0f6b6119 \
    4--prepare 'git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' \
    5'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0 -reindex-chainstate'
    

    Which indicates that this change speeds up IBD (or at least reindex-chainstate) by roughly ~3% (i.e. reverts the regression):

    • 866f4fa521f6932162570d6531055cc007e3d0cd - baseline
    • 3efc72ff7cbdfb788b23bf4346e29ba99362c120 - end-to-end vector to uint64
    • 08db1794647c37f966c525411f931a4a0f6b6119 - dummy commit that forces obfuscation keys to be 0 to ignore xor operations (since -blocksxor=0 doesn’t affect dbwrapper), to see how xor affects speed in general

    Benchmark 1: COMMIT=866f4fa521f6932162570d6531055cc007e3d0cd ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0 -reindex-chainstate

    Time (mean ± σ): 20846.396 s ± 72.227 s [User: 22353.773 s, System: 2563.864 s] Range (min … max): 20795.323 s … 20897.468 s 2 runs

    Benchmark 2: COMMIT=3efc72ff7cbdfb788b23bf4346e29ba99362c120 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0 -reindex-chainstate

    Time (mean ± σ): 20308.036 s ± 100.345 s [User: 21745.076 s, System: 2591.459 s] Range (min … max): 20237.081 s … 20378.991 s 2 runs

    Benchmark 3: COMMIT=08db1794647c37f966c525411f931a4a0f6b6119 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0 -reindex-chainstate

    Time (mean ± σ): 20293.227 s ± 9.923 s [User: 21711.799 s, System: 2588.886 s] Range (min … max): 20286.210 s … 20300.244 s 2 runs

    0Summary
    1'COMMIT=08db1794647c37f966c525411f931a4a0f6b6119 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0 -reindex-chainstate' ran
    2 1.00 ± 0.00 times faster than 'COMMIT=3efc72ff7cbdfb788b23bf4346e29ba99362c120 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0 -reindex-chainstate'
    3 1.03 ± 0.00 times faster than 'COMMIT=866f4fa521f6932162570d6531055cc007e3d0cd ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0 -reindex-chainstate'
    
  78. in src/test/streams_tests.cpp:301 in a123297318 outdated
    299+        uint64_t xor_key;
    300+        std::memcpy(&xor_key, xor_pat.data(), 8);
    301+
    302         DataStream ds{in};
    303-        ds.Xor({0xff});
    304+        ds.Xor(xor_key);
    


    hodlinator commented at 8:26 pm on November 11, 2024:

    Let’s not add bloat for this case.

    0        ds.Xor(0xffffffffffffffff);
    

    l0rinc commented at 9:59 pm on November 11, 2024:
    Please see: #31144 (review)

    hodlinator commented at 10:26 pm on November 11, 2024:
    All ones (binary) is not endian-sensitive. IMO it’s okay for the tests to look slightly different to reduce this kind of noise.

    l0rinc commented at 10:28 pm on November 11, 2024:
    I want to test the setup we have in production, these tests were very useful in revealing endianness problems.
  79. in src/test/streams_tests.cpp:315 in a123297318 outdated
    316+        uint64_t xor_key;
    317+        std::memcpy(&xor_key, xor_pat.data(), 8);
    318+
    319         DataStream ds{in};
    320-        ds.Xor({0xff, 0x0f});
    321+        ds.Xor(xor_key);
    


    hodlinator commented at 8:32 pm on November 11, 2024:

    Should work regardless of endianness.

    0        ds.Xor(0xff0fff0fff0fff0f);
    

    l0rinc commented at 9:56 pm on November 11, 2024:
    It doesn’t (at least not for all test cases), I’m deliberately testing generating vectors and converting them instead of generating 64 bit values since that’s what happens in production code (this should address multiple of your comments)

    hodlinator commented at 10:45 pm on November 11, 2024:
  80. in src/test/streams_tests.cpp:82 in a123297318 outdated
    78     const std::vector<uint8_t> test1{1, 2, 3};
    79     const std::vector<uint8_t> test2{4, 5};
    80-    const std::vector<std::byte> xor_pat{std::byte{0xff}, std::byte{0x00}};
    81+    const std::vector xor_pat{std::byte{0xff}, std::byte{0x00}, std::byte{0xff}, std::byte{0x00}, std::byte{0xff}, std::byte{0x00}, std::byte{0xff}, std::byte{0x00}};
    82+    uint64_t xor_key;
    83+    std::memcpy(&xor_key, xor_pat.data(), 8);
    


    hodlinator commented at 8:35 pm on November 11, 2024:
    0    const uint64_t xor_pat{0xff00ff00ff00ff00};
    

    l0rinc commented at 9:56 pm on November 11, 2024:
    Please see: #31144 (review)

    hodlinator commented at 10:35 pm on November 11, 2024:
    Feel like I’m imitating ChatGPT: Ah, yes, I get now how the memcpy() interaction with endianness preserves the functioning of this specific test.
  81. in src/dbwrapper.cpp:257 in a123297318 outdated
    261-
    262-    if (!key_exists && params.obfuscate && IsEmpty()) {
    263-        // Initialize non-degenerate obfuscation if it won't upset
    264-        // existing, non-obfuscated data.
    265-        std::vector<unsigned char> new_key = CreateObfuscateKey();
    266+    obfuscate_key = 0; // needed for unobfuscated Read
    


    hodlinator commented at 9:19 pm on November 11, 2024:
    0    obfuscate_key = 0; // Needed for unobfuscated Read
    

    l0rinc commented at 11:07 pm on November 11, 2024:
    Done
  82. in src/node/mempool_persist.cpp:64 in a123297318 outdated
    62         if (version == MEMPOOL_DUMP_VERSION_NO_XOR_KEY) {
    63-            // Leave XOR-key empty
    64+            file.SetXor(0);
    65         } else if (version == MEMPOOL_DUMP_VERSION) {
    66-            file >> xor_key;
    67+            std::vector<std::byte> xor_key_vector(8);
    


    hodlinator commented at 9:27 pm on November 11, 2024:

    Here and on line 185:

    0            std::vector<std::byte> xor_key_vector{8};
    

    l0rinc commented at 11:07 pm on November 11, 2024:
    Now I had to rename the PR :D
  83. in src/streams.h:49 in a123297318 outdated
    59+    switch (write.size()) { // Specify the optimization categories
    60+    case 0: break;
    61+    case 1: XorInt(write, key, 1); break;
    62+    case 2: XorInt(write, key, 2); break;
    63+    case 4: XorInt(write, key, 4); break;
    64+    default: XorInt(write, key, write.size());
    


    hodlinator commented at 9:36 pm on November 11, 2024:
    Care to elaborate why we don’t just have the default statement?

    l0rinc commented at 9:59 pm on November 11, 2024:
    “Specify the optimization categories” means that the compiler will be able to optimize the cases where one of the parameters (the size) is a constant separately from each other. The default statement would work, but would be very slow, since the 1, 2 and 4 byte versions won’t be specialized.

    hodlinator commented at 10:42 pm on November 11, 2024:

    How about

    0    // Help optimizers along by sending constant parameter values into the inlined function,
    1    // resulting in more efficient substitutions of memcpy() -> native pow-2 copy instructions.
    2    switch (write.size()) {
    3    case 0: break;
    4    case 1: XorInt(write, key, 1); break;
    5    case 2: XorInt(write, key, 2); break;
    6    case 4: XorInt(write, key, 4); break;
    7    default: XorInt(write, key, write.size());
    

    l0rinc commented at 11:07 pm on November 11, 2024:
    I ended up with only // Help the compiler specialize 1, 2 and 4 byte cases, since the rest was just speculation
  84. hodlinator commented at 9:49 pm on November 11, 2024: contributor

    Reviewed a1232973189126cfc9526713011461709685fcc8

    git range-diff master a5cad72 a123297

    The awk script in the comment in c671dd17dced0d51845dc8d2148f288c4c44ecb2 doesn’t add the ' separators…?

    Care to explain the scaling_factor value?

    XorHistogram claims to use 8 GB of RAM. Could be a bit much if we want to be able to also run benchmarks on low-end devices.

     0diff --git a/src/bench/xor.cpp b/src/bench/xor.cpp
     1index 2ba8b17f08..bb193c7fcf 100644
     2--- a/src/bench/xor.cpp
     3+++ b/src/bench/xor.cpp
     4@@ -96269,18 +96269,13 @@ static void XorHistogram(benchmark::Bench& bench)
     5 
     6         total_bytes += scaled_count * size;
     7         for (size_t j{0}; j < scaled_count; ++j) {
     8-            std::vector<std::byte> ret;
     9-            ret.reserve(size);
    10-            ret.insert(ret.end(), pattern.begin(), pattern.begin() + size);
    11-            test_data.push_back(std::move(ret));
    12+            test_data.emplace_back(pattern.begin(), pattern.begin() + size);
    13         }
    14     }
    15     assert(total_bytes == 8'129'394'848); // ~8 GB of data
    16     std::shuffle(test_data.begin(), test_data.end(), rng); // Make it more realistic & less predictable
    17 
    18-    std::vector key_bytes{rng.randbytes<std::byte>(8)};
    19-    uint64_t key;
    20-    std::memcpy(&key, key_bytes.data(), 8);
    21+    const uint64_t key{rng.rand64()};
    22 
    23     size_t offset{0};
    24     bench.batch(total_bytes).unit("byte").run([&] {
    25@@ -96296,9 +96291,7 @@ static void AutoFileXor(benchmark::Bench& bench)
    26     FastRandomContext rng{/*fDeterministic=*/true};
    27     auto data{rng.randbytes<std::byte>(1'000'000)};
    28 
    29-    std::vector<std::byte> empty_key_bytes(8, std::byte{0}); // Test disabled xor
    30-    uint64_t empty_key;
    31-    std::memcpy(&empty_key, empty_key_bytes.data(), 8);
    32+    uint64_t empty_key{0};
    33 
    34     const fs::path test_path = fs::temp_directory_path() / "xor_benchmark.dat";
    35     AutoFile f{fsbridge::fopen(test_path, "wb+"), empty_key};
    
  85. l0rinc commented at 10:11 pm on November 11, 2024: contributor

    The awk script in the comment in https://github.com/bitcoin/bitcoin/commit/c671dd17dced0d51845dc8d2148f288c4c44ecb2 doesn’t add the ’ separators…?

    I’ve added them manually.

    Care to explain the scaling_factor value?

    ~The values in the histogram (i.e. total bytes streamed through xor) add up to 92gb, but 1 byte values occupy half of that.~ (edit: this is only true for the first row, in other cases we would need to multiply by the first column)

    Since we have thousands of big values that represent vector of that size, we have to include all of those into the test set at least once. I had to scale down the histogram such that the lower values, having a few hundred occurrences aren’t flattened out completely - to make the histogram still representative. Do you have a better idea?

    XorHistogram claims to use 8 GB of RAM. Could be a bit much if we want to be able to also run benchmarks on low-end devices.

    Yes, but if I scale it down more, more values will be equal in the histogram and it won’t reflect real usage. That’s why I’ve set it to low priority, we don’t have to run these for every execution.

    Edit: pushed some nits to git range-diff 866f4fa521f6932162570d6531055cc007e3d0cd..a1232973189126cfc9526713011461709685fcc8 866f4fa521f6932162570d6531055cc007e3d0cd..57caa965b5ae284e501f892415d60fcb536f4c0e

  86. l0rinc force-pushed on Nov 11, 2024
  87. l0rinc renamed this:
    optimization: change XOR obfuscation key from `std::vector<std::byte>(8)` to `uint64_t`
    optimization: change XOR obfuscation key from `std::vector<std::byte>{8}` to `uint64_t`
    on Nov 11, 2024
  88. hodlinator commented at 11:43 pm on November 11, 2024: contributor

    Care to explain the scaling_factor value?

    The values in the histogram (i.e. total bytes streamed through xor) add up to ~92gb, but 1 byte values occupy half of that. I had to scale down the histogram such that the lower values, having a few hundred occurrences aren’t flattened out completely - to make the histogram still representative. Do you have a better idea?

    So raw_histogram really is the raw data?

    Added:

    0    uint64_t raw_total_bytes{0};
    1    for (size_t i{0}; i < std::size(raw_histogram); i += 2) {
    2        const uint64_t size = raw_histogram[i];
    3        const uint64_t count = raw_histogram[i + 1];
    4        raw_total_bytes += size * count;
    5    }
    6    printf("raw_total_bytes: %ld", raw_total_bytes);
    

    Got:

    0raw_total_bytes: 1'277'637'746'542 # "'" added manually
    

    So something like 1.28TB instead of 92GB, but I can’t seem to get my head screwed on right today.


    If I re-understood correctly what the code was doing with the scaling - it’s doing only 1'000'000 XOR-passes for the most common size (1 byte) instead of 47'584'838'861, and scaling down the number of XOR-passes for the others by the same factor.

    Thought the following would be slightly faster:

     0diff --git a/src/bench/xor.cpp b/src/bench/xor.cpp
     1index 2ba8b17f08..5e933d74ea 100644
     2--- a/src/bench/xor.cpp
     3+++ b/src/bench/xor.cpp
     4@@ -96253,7 +96253,7 @@ static void XorHistogram(benchmark::Bench& bench)
     5     };
     6 
     7     const auto max_count{static_cast<double>(raw_histogram[1])}; // 1 byte is the most common
     8-    const auto scaling_factor{1'000'000U};
     9+    const auto scaling_factor{1'000'000.0 / max_count};
    10 
    11     FastRandomContext rng{/*fDeterministic=*/true};
    12     const size_t pattern_size = raw_histogram[std::size(raw_histogram) - 2];
    13@@ -96265,7 +96265,7 @@ static void XorHistogram(benchmark::Bench& bench)
    14     for (size_t i{0}; i < std::size(raw_histogram); i += 2) {
    15         const uint64_t size = raw_histogram[i];
    16         const uint64_t count = raw_histogram[i + 1];
    17-        const size_t scaled_count{static_cast<size_t>(std::ceil((static_cast<double>(count) / max_count) * scaling_factor))};
    18+        const size_t scaled_count{static_cast<size_t>(std::ceil(static_cast<double>(count) * scaling_factor))};
    19 
    20         total_bytes += scaled_count * size;
    21         for (size_t j{0}; j < scaled_count; ++j) {
    

    Still passes the total_bytes == 8'129'394'848-assert unmodified, but seems slightly slower in my measurements (including for posterity).

  89. hodlinator commented at 9:21 am on November 12, 2024: contributor

    Reviewed 57caa965b5ae284e501f892415d60fcb536f4c0e

    git range-diff master a123297 57caa96

    • Applied most of my nits that were still valid.

    I understand you were burned by endianness but I disagree that it’s worth sacrificing readability where endianness is a non-issue.

    For cases where endianness is a concern, I suggest the following pattern:

    0uint64_t xor_key;
    1constexpr std::array<uint8_t, sizeof xor_key> xor_pat{std::to_array<uint8_t>({0xff, 0x00, 0xff, 0x00, 0xff, 0x00, 0xff, 0x00})};
    2std::memcpy(&xor_key, xor_pat.data(), sizeof xor_key);
    
    1. Declare the uint64_t first so sizeof can be used in place of magic 8.
    2. Use constexpr std::array instead of const std::vector to constrain the size.
    3. Use to_array initialization since regular brace-initialization will only complain if too many elements are given, not too few.
    4. Use uint8_t in favor of std::byte to improve readability.
    5. Use sizeof <target> for the memcpy call since it is good practice to avoid writing out of bounds and also avoids the magic 8.

    Use ~0ULL to express a uint64_t with all bits set, ensuring one doesn’t forget an f.


     0diff --git a/src/bench/xor.cpp b/src/bench/xor.cpp
     1index f94a1a6f96..fdf41b3696 100644
     2--- a/src/bench/xor.cpp
     3+++ b/src/bench/xor.cpp
     4@@ -96275,10 +96275,7 @@ static void XorHistogram(benchmark::Bench& bench)
     5     assert(total_bytes == 8'129'394'848); // ~8 GB of data
     6     std::shuffle(test_data.begin(), test_data.end(), rng); // Make it more realistic & less predictable
     7 
     8-    std::vector key_bytes{rng.randbytes<std::byte>(8)};
     9-    uint64_t key;
    10-    std::memcpy(&key, key_bytes.data(), 8);
    11-
    12+    const uint64_t key{rng.rand64()};
    13     size_t offset{0};
    14     bench.batch(total_bytes).unit("byte").run([&] {
    15         for (auto& data : test_data) {
    16@@ -96293,10 +96290,7 @@ static void AutoFileXor(benchmark::Bench& bench)
    17     FastRandomContext rng{/*fDeterministic=*/true};
    18     auto data{rng.randbytes<std::byte>(1'000'000)};
    19 
    20-    std::vector<std::byte> empty_key_bytes(8, std::byte{0}); // Test disabled xor
    21-    uint64_t empty_key;
    22-    std::memcpy(&empty_key, empty_key_bytes.data(), 8);
    23-
    24+    const uint64_t empty_key{0}; // Test disabled xor
    25     const fs::path test_path = fs::temp_directory_path() / "xor_benchmark.dat";
    26     AutoFile f{fsbridge::fopen(test_path, "wb+"), empty_key};
    27     bench.batch(data.size()).unit("byte").run([&] {
    28diff --git a/src/test/streams_tests.cpp b/src/test/streams_tests.cpp
    29index 5e9b607b3a..ac51eb3e51 100644
    30--- a/src/test/streams_tests.cpp
    31+++ b/src/test/streams_tests.cpp
    32@@ -77,9 +77,9 @@ BOOST_AUTO_TEST_CASE(xor_file)
    33     auto raw_file{[&](const auto& mode) { return fsbridge::fopen(xor_path, mode); }};
    34     const std::vector<uint8_t> test1{1, 2, 3};
    35     const std::vector<uint8_t> test2{4, 5};
    36-    const std::vector xor_pat{std::byte{0xff}, std::byte{0x00}, std::byte{0xff}, std::byte{0x00}, std::byte{0xff}, std::byte{0x00}, std::byte{0xff}, std::byte{0x00}};
    37     uint64_t xor_key;
    38-    std::memcpy(&xor_key, xor_pat.data(), 8);
    39+    constexpr std::array<uint8_t, sizeof xor_key> xor_pat{std::to_array<uint8_t>({0xff, 0x00, 0xff, 0x00, 0xff, 0x00, 0xff, 0x00})};
    40+    std::memcpy(&xor_key, xor_pat.data(), sizeof xor_key);
    41 
    42     {
    43         // Check errors for missing file
    44@@ -293,12 +293,8 @@ BOOST_AUTO_TEST_CASE(streams_serializedata_xor)
    45     in.push_back(std::byte{0xf0});
    46 
    47     {
    48-        const std::vector xor_pat{std::byte{0xff}, std::byte{0xff}, std::byte{0xff}, std::byte{0xff}, std::byte{0xff}, std::byte{0xff}, std::byte{0xff}, std::byte{0xff}};
    49-        uint64_t xor_key;
    50-        std::memcpy(&xor_key, xor_pat.data(), 8);
    51-
    52         DataStream ds{in};
    53-        ds.Xor(xor_key);
    54+        ds.Xor(~0ULL);
    55         BOOST_CHECK_EQUAL("\xf0\x0f"s, ds.str());
    56     }
    57 
    58@@ -307,9 +303,9 @@ BOOST_AUTO_TEST_CASE(streams_serializedata_xor)
    59     in.push_back(std::byte{0x0f});
    60 
    61     {
    62-        const std::vector xor_pat{std::byte{0xff}, std::byte{0x0f}, std::byte{0xff}, std::byte{0x0f}, std::byte{0xff}, std::byte{0x0f}, std::byte{0xff}, std::byte{0x0f}};
    63         uint64_t xor_key;
    64-        std::memcpy(&xor_key, xor_pat.data(), 8);
    65+        constexpr std::array<uint8_t, sizeof xor_key> xor_pat{std::to_array<uint8_t>({0xff, 0x0f, 0xff, 0x0f, 0xff, 0x0f, 0xff, 0x0f})};
    66+        std::memcpy(&xor_key, xor_pat.data(), sizeof xor_key);
    67 
    68         DataStream ds{in};
    69         ds.Xor(xor_key);
    
  90. fanquake commented at 10:07 am on November 12, 2024: member
    I haven’t really looked at the changes here, but just looking at the diff (+96'000 lines), my feedback would be that you’ll need to change approach in regards to making your data available (if the intent is to have that included), as I doubt we’ll be adding 96'000 lines to bench/xor.cpp. You could look at how we generate a header from bench/data/block413567.raw for a different approach, as including a small binary blob, and parsing it into a header at compile time is far more palatable.
  91. l0rinc commented at 3:16 pm on November 12, 2024: contributor

    Thanks @fanquake, I thought of that, can you please help me understand the constraints? Wouldn’t that require a cmake generation step from binary to header which would basically produce the exact same lines as what we have now? Would it help if I simply extracted it to a separate header file instead?

    So something like 1.28TB instead of 92GB, but I can’t seem to get my head screwed on right today.

    Yeah, I’ve edited that part since, my napkin calculations were only true for the first row, in other cases we would need to multiply by the first column - like you did. But the point is that it’s a lot of data that we have to scale down.

    but seems slightly slower in my measurements

    I thought of that as well, but wanted to avoid floating point conversion (likely the reason for the slowness in your example)

  92. fanquake commented at 5:51 pm on November 12, 2024: member

    Wouldn’t that require a cmake generation step from binary to header which would basically produce the exact same lines as what we have now?

    Yes. See bench/data/block413567.raw & bench/data/block413567.raw.h, where at build time a header file of ~125'000 lines is produced.

    Would it help if I simply extracted it to a separate header file instead?

    I don’t think so. The point is more to not add 100'000s of lines of “data” to this repo, which doesn’t scale across many benchmarks, creates unusable diffs, leaves (source) files unviewable on GH etc.

  93. test: Compare util::Xor with randomized inputs against simple impl
    Since production code only uses keys of length 8, we're not testing with other values anymore
    467add94ca
  94. bench: Make Xor benchmark more representative
    To make the benchmarks representative, I've collected the write vector's sizes during IBD for every invocation of `util::Xor` until 860k blocks, and used it as a basis for the micro-benchmarks, having a similar distribution of random data.
    The histogram is in a separate compressed file (storing the differences compared to the previous row, achieving 15x compression), which is converted during build to a decompressed histogram (i.e. how many utils::Xor calls were performed during IBD with vectors of the given size).
    
    And even though we already have serialization tests, `AutoFileXor` was added to serializing 1 MB via the provided key_bytes. This was used to test the effect of disabling obfuscation.
    
    >  cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build -j$(nproc) \
    && build/src/bench/bench_bitcoin -filter='XorHistogram|AutoFileXor' -min-time=10000
    
    C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |             ns/byte |              byte/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |                1.86 |      536,276,762.89 |    0.3% |     10.25 | `AutoFileXor`
    |                1.04 |      961,221,881.72 |    0.2% |     93.30 | `XorHistogram`
    
    C++ compiler .......................... GNU 13.2.0
    
    |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |                1.83 |      547,257,936.65 |    0.2% |            9.20 |            3.47 |  2.655 |           1.03 |    0.1% |     11.04 | `AutoFileXor`
    |                0.97 |    1,032,916,679.87 |    0.0% |            9.01 |            3.29 |  2.738 |           1.00 |    0.0% |     86.58 | `XorHistogram`
    6a84bc0526
  95. optimization: Xor 64 bits together instead of byte-by-byte
    `util::Xor` method was split out into more focused parts:
    * one which assumes tha the `uint64_t` key is properly aligned, doing the first few xors as 64 bits (the memcpy is eliminated in most compilers), and the last iteration is optimized for 8/16/32 bytes.
    * an unaligned `uint64_t` key with a `key_offset` parameter which is rotated to accommodate the data (adjusting for endianness).
    * a legacy `std::vector<std::byte>` key with an asserted 8 byte size, converted to `uint64_t`.
    
    Note that the default statement alone would pass the tests, but would be very slow, since the 1, 2 and 4 byte versions won't be specialized by the compiler, hence the switch.
    
    Asserts were added throughout the code to make sure every such vector has length 8, since in the next commit we're converting all of them to `uint64_t`.
    8e12f5dc6a
  96. refactor: Migrate fixed-size obfuscation end-to-end from `std::vector<std::byte>` to `uint64_t`
    Since `util::Xor` accepts `uint64_t` values, we're eliminating any repeated vector-to-uint64_t conversions going back to the loading/saving of these values (we're still serializing them as vectors, but converting as soon as possible to `uint64_t`). This is the reason the tests still generate vector values and convert to `uint64_t` later instead of generating it directly.
    
    We're also short-circuit `Xor` calls with 0 key values early to avoid unnecessary calculations (e.g. `MakeWritableByteSpan`).
    
    >  cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build -j$(nproc) \
    && build/src/bench/bench_bitcoin -filter='XorHistogram|AutoFileXor' -min-time=10000
    
    C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |             ns/byte |              byte/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |                0.16 |    6,087,454,300.27 |    0.5% |     11.38 | `AutoFileXor`
    |                0.03 |   28,649,147,111.58 |    0.2% |      6.78 | `XorHistogram`
    
    C++ compiler .......................... GNU 13.2.0
    
    |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |                0.53 |    1,870,051,822.68 |    0.0% |            0.00 |            0.00 |  0.816 |           0.00 |    2.1% |     11.18 | `AutoFileXor`
    |                0.10 |   10,369,097,976.62 |    0.0% |            0.32 |            0.33 |  0.985 |           0.06 |    0.6% |      8.63 | `XorHistogram`
    
    ----
    
    A few other benchmarks that seem to have improved as well (tested with Clang only):
    Before:
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        3,146,553.30 |              317.81 |    0.1% |     11.02 | `ReadBlockFromDiskTest`
    |        1,014,134.52 |              986.06 |    0.2% |     11.00 | `ReadRawBlockFromDiskTest`
    
    After:
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        2,522,199.92 |              396.48 |    0.1% |     11.03 | `ReadBlockFromDiskTest`
    |           64,484.30 |           15,507.65 |    0.2% |     10.58 | `ReadRawBlockFromDiskTest`
    2ec822c723
  97. l0rinc force-pushed on Nov 13, 2024
  98. l0rinc commented at 3:27 pm on November 13, 2024: contributor

    I understand you were burned by endianness but I disagree that it’s worth sacrificing readability where endianness is a non-issue.

    Thanks @hodlinator for the suggestions, I tried them all, but in the end decided that I value consistency more than coming up with a separate solution for each test case. These are ugly, I agree, but at least they’re testing the setup we’re using in prod. I did however change the hard-coded 8 values to sizeof xor_key (for memcpy) or sizeof(uint64_t) for vector inits.

    The point is more to not add 100'000s of lines of “data” to this repo

    I have stored the sorted diffs in the end (since the lines are correlated, i.e. more likely to have similar neighbours) and compressed them using .tar.gz (added the generator python script as a gist, please verify) - this way the histogram data is ~100 kb instead of 1.7 MB (thanks for the hint @fanquake). I’ve extended GenerateHeaderFromRaw.cmake with compression support (adjusting GenerateHeaders.cmake to trim the suffix from the header name) and added more safety asserts to make sure the data read back is the same as before.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-11-23 09:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me