util: improve FindByte() performance #19690

pull LarryRuane wants to merge 2 commits into bitcoin:master from LarryRuane:FindByte-speedup changing 6 files +50 −8
  1. LarryRuane commented at 3:49 AM on August 10, 2020: contributor

    This PR is strictly a performance improvement; there is no functional change. The CBufferedFile::FindByte() method searches for the next occurrence of the given byte in the file. Currently, this is done by explicitly inspecting each byte in turn. This PR takes advantage of std::find() to do the same more efficiently, improving its CPU runtime by a factor of about 25 in typical use.

  2. LarryRuane commented at 3:50 AM on August 10, 2020: contributor

    This PR was suggested by @hebasto #16981 (comment) (thank you!)

  3. fanquake added the label Utils/log/libs on Aug 10, 2020
  4. in src/streams.h:842 in ab412ec307 outdated
     840 | -            if (vchBuf[nReadPos % vchBuf.size()] == ch)
     841 | -                break;
     842 | -            nReadPos++;
     843 | +            size_t n = vchBuf.size() - start;
     844 | +            if (n > nSrcPos - nReadPos)
     845 | +                n = nSrcPos - nReadPos;
    


    promag commented at 8:09 AM on August 10, 2020:

    nit, join with above line.

  5. promag commented at 8:15 AM on August 10, 2020: member

    The performance gain looks substantial - didn't verified.

  6. hebasto commented at 11:10 AM on August 10, 2020: member

    @LarryRuane

    ... improving its CPU runtime by a factor of about 25 in typical use.

    How was it measured?

  7. laanwj commented at 11:38 AM on August 10, 2020: member

    Any idea why this is so much faster? As far as I know, there is no faster algorithm to look for the first occurrence of a single byte in a memory array than a linear iteration over it. I'd expect std::find of a byte to simply unroll into a loop.

  8. LarryRuane commented at 2:38 PM on August 10, 2020: contributor

    How was it measured?

    I ran:

    time src/test/test_bitcoin --run_test=streams_tests/streams_findbyte
    

    with master (1m20s) and with master + PR (3s).

    Any idea why this is so much faster?

    I think you're correct that there's no faster way than a loop with runtime proportional to the number of bytes to scan, but I assume std::find() on a char vector is highly optimized, probably using memchr() or memcmp(), which are implemented in assembly language. Also, the master version does a few things each byte (testing nReadPos == nSrcPos, remainder calculation (%), incrementing nReadPos) that the PR does once for each large run of bytes.

    I just noticed that the repetition count on the test is set to a large number (50000000) and I meant to reduce it for the commit (3 seconds is too long to add to the unit test suite). I'll reduce that number in force push in a minute. This test doesn't really need to be in this PR (FindByte()'s functionality is tested very well in another test), but it helps reviewers verify the performance improvement.

  9. LarryRuane force-pushed on Aug 10, 2020
  10. LarryRuane commented at 2:46 PM on August 10, 2020: contributor

    Force-push a small fix to the test, so it doesn't take 3 seconds to run.

  11. laanwj commented at 4:43 PM on August 10, 2020: member

    but I assume std::find() on a char vector is highly optimized, probably using memchr() or memcmp(), which are implemented in assembly language

    That's true, it's possible to optimize that with assembly language (definitely with specific instruction sets).

    it still surprises me because you'd expect the I/O, to read the data from disk, to dominate greatly in the block importing. Not looking for a character already in memory! It just seems out of proportion.

  12. in src/test/streams_tests.cpp:463 in a31aa32554 outdated
     458 | +    }
     459 | +    b = 1;
     460 | +    fwrite(&b, 1, 1, file);
     461 | +    rewind(file);
     462 | +    CBufferedFile bf(file, fileSize * 2, fileSize, 0, 0);
     463 | +    for (int rep = 0; rep < 100; ++rep) {
    


    glozow commented at 7:19 PM on August 10, 2020:

    I don't fully understand how the performance increase is so significant, but why not a bench if you're worried about burdening the unit tests? I tried to do this but I must be doing something wrong because I can't seem to reproduce the speedup. 😞


    sipa commented at 8:12 PM on August 10, 2020:

    The performance impact may be very compiler/architecture/stdlib dependent. I'm kind of surprised std::find has optimizations beyond the naive loop implementation in the first place on some platforms, so I certainly wouldn't be surprised if others don't have it.


    LarryRuane commented at 9:39 PM on August 10, 2020:

    @gzhao408, thank you, I wasn't aware of bench. I suspect the iteration count, 100, is far too low and the difference is swamped by the noise. I increased the iteration count to 10m (1e7) and it showed the expected difference, master: 1,659.30 ns/op, PR: 52.66 ns/op (ratio is about 31).

    I just force-pushed (diff) to remove the unit test (which was only for benchmarking, not really testing anything) and cherry-pick Gloria's benchmark. I added one more commit to increase the iteration count.

  13. LarryRuane commented at 8:08 PM on August 10, 2020: contributor

    it still surprises me because you'd expect the I/O, to read the data from disk, to dominate greatly in the block importing.

    FindByte() only reads from disk by calling Fill() (when the buffer is empty), which is rare. In this test, Fill() gets called only once, the first time FindByte() runs, because I wanted to isolate the modified part of the code.

  14. LarryRuane force-pushed on Aug 10, 2020
  15. sipa commented at 12:42 AM on August 11, 2020: member

    @gmaxwell pointed out to me why this is so much faster: it's not that std::find is amazing, but that the original code (which I wrote in 2012, it seems!) is doing a modulus operation for every character (which is often orders of magnitude slower than the byte comparison or addition/subtraction).

    Thinking about this a bit more high level: the end goal is just to scan quickly for the 4-byte network magic in a file. If this is relevant for performance (and it seems it may be), it may be better to have a function that does exactly that in CBufferedFile, rather than a search for one byte + memcmp. std::search is probably what you want.

  16. LarryRuane commented at 2:10 PM on August 12, 2020: contributor

    Here's a version that's very close in implementation to master that eliminates the % operation on every character:

        void FindByte(char ch) {
            size_t start = nReadPos % vchBuf.size();
            while (true) {
                if (nReadPos == nSrcPos)
                    Fill();
                if (vchBuf[start] == ch)
                    break;
                nReadPos++;
                start++;
                if (start >= vchBuf.size()) start = 0;
            }
    

    and that does make a significant difference; bench_bitcoin reports 314 ns/op for this version. (My laptop isn't set up for very accurate CPU benchmarking, but my results are pretty consistent, varying by less than 1% across multiple runs.) That's about 5.3 times as fast as master (1,659.30 ns/op, as mentioned above). The PR (52.66 ns/op) is still about 6 times faster than this version. So this version is right about in the middle (in terms of ratios) between master and this PR.

    I do like @sipa's suggestion to generalize FindByte() to search for a given sequence of bytes (maybe called FindBytes() or Search(), suggestions welcome); that's a much nicer interface. I'll push a commit to do that within the next few hours.

  17. LarryRuane commented at 5:35 AM on August 13, 2020: contributor

    I just added a new commit (25413ab, can squash later) to implement @sipa's suggestion to add a method to CBufferedFile to find a sequence of bytes, rather than just one byte, as FindByte() does. This simplifies LoadExternalBlockFile(); the overall code base isn't simpler, but it encapsulates complexity nicely within the CBufferedFile class.

    It didn't work out to use std::search() because that function requires the bytes being searched to be consecutive in memory -- that works if the entire file is read into contiguous memory. But CBufferedFile implements a circular buffer. The bytes we're searching for might be partially in and partially out of memory (bytes are read on demand), and even if all are in memory, they may be split between the end of the buffer and the beginning (because of wraparound), so trying to use std::search gets complicated. FindByte() doesn't have these problems since it's looking for a single byte.

  18. adamjonas commented at 12:39 AM on August 14, 2020: member

    Looks like one of the sanitizers is finding an implicit-integer-sign-change problem in the latest update:

    �[0;34m node0 stderr streams.h:857:22: runtime error: implicit conversion from type 'unsigned char' of value 250 (8-bit, unsigned) to type 'char' changed the value to -6 (8-bit, signed)
        [#0](/bitcoin-bitcoin/0/) 0x55f6d4cc6285 in CBufferedFile::Search(unsigned char const*, unsigned long) /tmp/cirrus-ci-build/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/src/./streams.h:857:22
        [#1](/bitcoin-bitcoin/1/) 0x55f6d4c967a0 in LoadExternalBlockFile(CChainParams const&, _IO_FILE*, FlatFilePos*) /tmp/cirrus-ci-build/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/src/validation.cpp:4682:24
        [#2](/bitcoin-bitcoin/2/) 0x55f6d45071f6 in ThreadImport(ChainstateManager&, std::vector<boost::filesystem::path, std::allocator<boost::filesystem::path> >) /tmp/cirrus-ci-build/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/src/init.cpp:705:13
        [#3](/bitcoin-bitcoin/3/) 0x55f6d4506814 in AppInitMain(util::Ref const&, NodeContext&)::$_10::operator()() const /tmp/cirrus-ci-build/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/src/init.cpp:1853:96
        [#4](/bitcoin-bitcoin/4/) 0x55f6d4506814 in std::_Function_handler<void (), AppInitMain(util::Ref const&, NodeContext&)::$_10>::_M_invoke(std::_Any_data const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/std_function.h:300:2
        [#5](/bitcoin-bitcoin/5/) 0x55f6d4585b29 in std::function<void ()>::operator()() const /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/std_function.h:688:14
        [#6](/bitcoin-bitcoin/6/) 0x55f6d45181bb in void TraceThread<std::function<void ()> >(char const*, std::function<void ()>) /tmp/cirrus-ci-build/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/src/./util/system.h:438:9
        [#7](/bitcoin-bitcoin/7/) 0x55f6d4506447 in void std::__invoke_impl<void, void (*)(char const*, std::function<void ()>), char const*, AppInitMain(util::Ref const&, NodeContext&)::$_10>(std::__invoke_other, void (*&&)(char const*, std::function<void ()>), char const*&&, AppInitMain(util::Ref const&, NodeContext&)::$_10&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:60:14
        [#8](/bitcoin-bitcoin/8/) 0x55f6d450616a in std::__invoke_result<void (*)(char const*, std::function<void ()>), char const*, AppInitMain(util::Ref const&, NodeContext&)::$_10>::type std::__invoke<void (*)(char const*, std::function<void ()>), char const*, AppInitMain(util::Ref const&, NodeContext&)::$_10>(void (*&&)(char const*, std::function<void ()>), char const*&&, AppInitMain(util::Ref const&, NodeContext&)::$_10&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:95:14
        [#9](/bitcoin-bitcoin/9/) 0x55f6d4505db2 in void std::thread::_Invoker<std::tuple<void (*)(char const*, std::function<void ()>), char const*, AppInitMain(util::Ref const&, NodeContext&)::$_10> >::_M_invoke<0ul, 1ul, 2ul>(std::_Index_tuple<0ul, 1ul, 2ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:244:13
        [#10](/bitcoin-bitcoin/10/) 0x55f6d4505db2 in std::thread::_Invoker<std::tuple<void (*)(char const*, std::function<void ()>), char const*, AppInitMain(util::Ref const&, NodeContext&)::$_10> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:251:11
        [#11](/bitcoin-bitcoin/11/) 0x55f6d4505db2 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(char const*, std::function<void ()>), char const*, AppInitMain(util::Ref const&, NodeContext&)::$_10> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:195:13
        [#12](/bitcoin-bitcoin/12/) 0x7f20f2f18cb3  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xd6cb3)
        [#13](/bitcoin-bitcoin/13/) 0x7f20f3317608 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x9608)
        [#14](/bitcoin-bitcoin/14/) 0x7f20f2bf5102 in clone (/lib/x86_64-linux-gnu/libc.so.6+0x122102)
    
    SUMMARY: UndefinedBehaviorSanitizer: implicit-integer-sign-change streams.h:857:22
    
  19. LarryRuane force-pushed on Aug 14, 2020
  20. LarryRuane commented at 6:05 AM on August 14, 2020: contributor

    Thanks, @adamjonas, force-pushed a fix for that signed-unsigned CI failure, added another unit test, some improvements to Search(), some clang-format-diff.py cleanups.

  21. DrahtBot commented at 8:44 PM on August 20, 2020: contributor

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--021abf342d371248e50ceaed478a90ca-->

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK stickies-v, achow101
    Stale ACK laanwj, sipa, hebasto, john-moffett

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    No conflicts as of last run.

  22. in src/bench/streams_findbyte.cpp:13 in 8f2e8c27bb outdated
       7 | +#include <test/util/setup_common.h>
       8 | +
       9 | +static void FindByte(benchmark::Bench& bench)
      10 | +{
      11 | +    // Setup
      12 | +    FILE* file = fsbridge::fopen("streams_tmp", "w+b");
    


    elichai commented at 10:09 AM on August 30, 2020:

    Could we maybe use something like memfd_create(2) for the benchmark, to decrease I/O noise? Anyone knows if there is a windows equivalent or a higher abstraction in boost::filesystem?


    elichai commented at 10:12 AM on August 30, 2020:

    fmemopen(3) also looks interesting


    elichai commented at 10:24 AM on August 30, 2020:

    LarryRuane commented at 3:28 PM on August 31, 2020:

    Thanks for the suggestions, but I don't think this matters, because the file IO (read) occurs only on the very first benchmark loop iteration; after that, the data is in memory and there is no IO at all. Each iteration repositions the stream pointer to zero (bf.SetPos(0)) and then searches forward for the value 1 (bf.FindByte(1)), which is 200 bytes away. But after the first iteration, all of these bytes are in memory, and we're just moving the position between 0 and 200.

    The reason for the 200, by the way, is that with random data, searching for a random byte value the average distance would be half of 256. But the data in blk.dat files (which is the only use of this code currently) isn't quite random; zero and 0xff occur more often (and we never search for those values). So I guessed that 200 is close to a typical distance that this function would move through before finding the requested byte. Maybe I should explain these points in a comment in bench/streams_findbyte.cpp.

  23. laanwj commented at 4:08 AM on November 20, 2020: member

    Code review ACK 8f2e8c27bb9bb3ebb9f094129ce4a6019c46cfc0, this looks like a better abstraction, and it's an improvement not to do a modulus for every byte.

  24. LarryRuane force-pushed on Dec 4, 2020
  25. LarryRuane commented at 9:33 PM on December 4, 2020: contributor

    Force-pushed to rationalize the commits, so I think it's in a good state for merging. I reordered the commit that adds the benchmark test, b576994bea620c7d065dd558c93222391151c9ff, to be first, so it's easy for reviewers to checkout that commit and run the new benchmark test without the improvements:

    src/bench/bench_bitcoin -filter=FindByte
    

    I re-ran those tests just now, and on my laptop, the ns/op without the PR is 1,840, and with the PR it's 55 (an improvement of more than a factor of 33).

  26. sipa commented at 10:33 PM on December 4, 2020: member

    Code review ACK 566c2e2eec1e84af5143f1554613a89192e29433. I think it'd be good to move the refactoring to FindByte in the last commit to a separate commit.

  27. LarryRuane force-pushed on Dec 5, 2020
  28. LarryRuane commented at 1:14 AM on December 5, 2020: contributor

    Force-pushed to implement latest review suggestion, changes are more cleanly separated among the commits (no code changes overall), thanks @sipa.

  29. in src/bench/streams_findbyte.cpp:33 in b576994bea outdated
      28 | +    // Cleanup
      29 | +    bf.fclose();
      30 | +    fs::remove("streams_tmp");
      31 | +}
      32 | +
      33 | +BENCHMARK(FindByte);
    


    hebasto commented at 2:28 PM on December 5, 2020:

    b576994bea620c7d065dd558c93222391151c9ff, nit: Add EOL Screenshot from 2020-12-05 16-27-26

  30. in src/test/streams_tests.cpp:342 in 134de90f2b outdated
     337 | +        bf.FindByte(99);
     338 | +        BOOST_CHECK(false);
     339 | +    } catch (const std::exception& e) {
     340 | +        BOOST_CHECK(strstr(e.what(),
     341 | +                           "CBufferedFile::Fill: end of file") != nullptr);
     342 | +    }
    


    hebasto commented at 2:30 PM on December 5, 2020:

    134de90f2b64ede6dbd812a9a88b00708685394d

        BOOST_CHECK_EXCEPTION(bf.FindByte(99), std::ios_base::failure, HasReason("CBufferedFile::Fill: end of file"));```
    
  31. in src/test/streams_tests.cpp:363 in 134de90f2b outdated
     358 | +            bf.Search(needle, 2);
     359 | +            BOOST_CHECK(false);
     360 | +        } catch (const std::exception& e) {
     361 | +            BOOST_CHECK(strstr(e.what(),
     362 | +                               "CBufferedFile::Fill: end of file") != nullptr);
     363 | +        }
    


    hebasto commented at 2:30 PM on December 5, 2020:

    134de90f2b64ede6dbd812a9a88b00708685394d

            BOOST_CHECK_EXCEPTION(bf.Search(needle, 2), std::ios_base::failure, HasReason("CBufferedFile::Fill: end of file"));
    
  32. hebasto changes_requested
  33. hebasto commented at 2:31 PM on December 5, 2020: member

    Concept ACK.

  34. in src/streams.h:614 in 134de90f2b outdated
     717 | @@ -718,7 +718,7 @@ class CBufferedFile
     718 |      uint64_t nReadPos;    //!< how many bytes have been read from this
     719 |      uint64_t nReadLimit;  //!< up to which position we're allowed to read
     720 |      uint64_t nRewind;     //!< how many bytes we guarantee to rewind
     721 | -    std::vector<char> vchBuf; //!< the buffer
     722 | +    std::vector<unsigned char> vchBuf; //!< the buffer
     723 |  
     724 |  protected:
    


    hebasto commented at 3:54 PM on December 5, 2020:

    While the CBufferedFile class is already touched, maybe make CBufferedFile::Fill private by removing this line?

  35. practicalswift commented at 4:55 PM on December 5, 2020: contributor

    @LarryRuane

    Thanks for providing promising micro benchmarks. AFAICT FindByte is only used when doing -reindex and -loadblock. Do you have any macro numbers too where the gains from switching to this improved version are measured in terms of the overall speed-up of said operations?

    Another thing: when fuzzing CBufferedFile I noticed that calling FindByte after a failed SetPos call may result in an infinite loop. Does the new code have that gotcha too? If not we can drop the setpos_fail logic in src/test/fuzz/buffered_file.cpp :)

  36. hebasto commented at 5:01 PM on December 5, 2020: member

    Another thing: when fuzzing CBufferedFile I noticed that calling FindByte after a failed SetPos call may result in an infinite loop. Does the new code have that gotcha too? If not we can drop the setpos_fail logic in src/test/fuzz/buffered_file.cpp :)

    Also some failed tests:

  37. LarryRuane marked this as a draft on Dec 6, 2020
  38. LarryRuane commented at 6:30 AM on December 7, 2020: contributor

    @practicalswift

    Do you have any macro numbers too where the gains from switching to this improved version are measured in terms of the overall speed-up of [-reindex and -loadblock]?

    I don't think there will be any measurable improvement in the sunny-day case, because the byte being searched for is the very first byte at the stream pointer. So no looping is needed (and this PR only improves the performance of that loop). But if there is corruption, damage in the blk.dat files, then FindByte() is being used to "resynchronize" with the stream, that is, to locate the start of the next valid block. I don't think that happens often in practice. But another advantage of this PR is that FindByte() may be used for something else in the future which does require actual looping in the sunny-day case.

    Another thing: when fuzzing CBufferedFile I noticed that calling FindByte after a failed SetPos call may result in an infinite loop. Does the new code have that gotcha too?

    TL;DR: No. We can remove that check.

    Details: I wanted to understand this better, so I checked out master immediately after your PR #19143 was merged, the commit where you added the SetPos() fuzzing (and the setpos_fail flag) (commit 614e0807a8137d82832aea45e4864b424f71f698), and did some investigation, and discovered that a failed SetPos() isn't what caused the infinite loop. The reason the setpos_fail flag seemed to fix the problem is that SetPos() typically fails very early in the fuzzing run, and that disables further FindByte() calls, after which the problem can't occur. (I didn't see anything that would prevent the infinite loop, but it's just unlikely.)

    The actual source of the problem is in the Seek() call. The implementation of that method calls the standard library fseek(), then, if that succeeds, ftell(), which is supposed to return the byte offset into the file of the FILE pointer. It sometimes returns -1 (which indicates error). That seems wrong, because if the fseek() is successful, then ftell() should return whatever offset we just seeked to. (I'm not sure why Seek() even called ftell(), since it already knows that offset.) This -1 (error) return is stored into both the CBufferedFile's read and write offsets (nReadPos and nSrcPos), which are unsigned, so things are messed up from here. (I was able to see why this caused an infinite loop, but it's not worth explaining here.)

    Anyway, this is all irrelevant now, because in #19593, @hebasto removed CBufferedFile::Seek because it wasn't being used. So, yes, we should remove this check (I did remove it and verified that there is no infinite loop), and I'll do that in this PR.

    Another thing I'd like to do is add fuzz testing for the new method CBufferedFile::Search() (it can be added to the existing test). That's why I changed this PR status to DRAFT. Also, I'll look into if those failed tests are real failures. Thanks for all your help!

  39. LarryRuane force-pushed on Dec 7, 2020
  40. LarryRuane commented at 5:51 PM on December 7, 2020: contributor

    Force-pushed the latest review suggestions:

    • remove unneeded protected: tag
    • remove unneeded setpos_fail from fuzz test
    • add fuzz test for new CBufferedFile::Search() method
    • simplify checking for expected exception in unit test (use BOOST_CHECK_EXCEPTION) @hebasto

    Also some failed tests

    I could not figure out those failures; I don't see how they have any relation to this PR.

  41. LarryRuane marked this as ready for review on Dec 7, 2020
  42. in src/bench/streams_findbyte.cpp:7 in 918de11daa outdated
       0 | @@ -0,0 +1,33 @@
       1 | +// Copyright (c) 2020 The Bitcoin Core developers
       2 | +// Distributed under the MIT software license, see the accompanying
       3 | +// file COPYING or http://www.opensource.org/licenses/mit-license.php.
       4 | +
       5 | +#include <bench/bench.h>
       6 | +#include <streams.h>
       7 | +#include <test/util/setup_common.h>
    


    hebasto commented at 2:31 PM on December 8, 2020:

    b576994bea620c7d065dd558c93222391151c9ff

    #include <bench/bench.h>
    
    #include <fs.h>
    #include <stdio.h>
    #include <streams.h>
    
  43. in src/bench/streams_findbyte.cpp:13 in 918de11daa outdated
       8 | +
       9 | +static void FindByte(benchmark::Bench& bench)
      10 | +{
      11 | +    // Setup
      12 | +    FILE* file = fsbridge::fopen("streams_tmp", "w+b");
      13 | +    const size_t fileSize = 200;
    


    hebasto commented at 2:33 PM on December 8, 2020:

    b576994bea620c7d065dd558c93222391151c9ff

        const size_t file_size = 200;
    
  44. in src/streams.h:848 in 918de11daa outdated
     848 | +            size_t inc = it_find - it_start;
     849 | +            nReadPos += inc;
     850 | +            if (inc < n) break;
     851 | +            i += inc;
     852 | +            if (i >= vchBuf.size()) i = 0;
     853 | +        }
    


    hebasto commented at 2:44 PM on December 8, 2020:

    865ed617512edb9e62dcce61aa48818c4ef3a295

    Some if statements could be eliminated :)

        void FindByte(unsigned char ch)
        {
            while (true) {
                if (nReadPos == nSrcPos) Fill();
                const size_t i = nReadPos % vchBuf.size();
                const size_t n = std::min<size_t>(vchBuf.size() - i, nSrcPos - nReadPos);
                const auto it_start = vchBuf.begin() + i;
                const auto it_find = std::find(it_start, it_start + n, ch);
                const size_t inc = it_find - it_start;
                nReadPos += inc;
                if (inc < n) break;
            }
        }
    

    It seems like names i and n could be more descriptive, no? But I have no a particular suggestion :)

  45. hebasto changes_requested
  46. hebasto commented at 5:13 PM on December 8, 2020: member

    Approach ACK 918de11daa4be9995122c5e4b0284285b27e0e8f

    While testing this branch on top of the master still having a fuzzing problem: load_external_block_file timeout. An infinite loop?

    If other reviewers are not opposed, I'd suggest to rebase this PR on top of the recent CI changes, in order to make this problem obvious to all reviewers.

    As bench/streams_findbyte.cpp is a new code, I believe it should follow our Developer Notes (header including and naming). If it'd be re-touched the adding an EOL could be moved from the third commit into the first one. Also the first commit message could contain a verb "Add" :)

    To get this PR closer to merging, maybe just drop changes that introduce the CBufferedFile::Search member function, as the performance issue has been resolved in the second commit?

  47. LarryRuane force-pushed on Dec 10, 2020
  48. LarryRuane commented at 2:33 AM on December 10, 2020: contributor

    @hebasto, thank you for your careful review! I force-pushed your suggestions except for moving nReadPos % vchBuf.size(); inside the loop. That does reduce the amount of code slightly, but the benchmark test showed that the time per operation consistently increased from 55ns to 67ns. That's still very good (much better than without this PR), but as long as we're changing this code, we may as well make it as fast as possible (within reason).

    I added you as a co-author, hope that's okay. I found better names for i and n, thanks.

    I took your suggestion to remove the new method, CBufferedFile::Search() and leave that for a later PR. That was a good suggestion because now this PR is small and more focused.

    It will be interesting to see if now the fuzzing passes. I don't understand how you determined (from the link you gave) that the problem is in load_external_block_file. When I run that locally (with no arguments), it succeeds normally. Is there a way I can run exactly the same test to debug the problem? I see it uses some fuzzing assets; could I download those?

  49. hebasto commented at 8:04 AM on December 10, 2020: member

    It will be interesting to see if now the fuzzing passes. I don't understand how you determined (from the link you gave) that the problem is in load_external_block_file.

    Job is timed out and the load_external_block_file is not listed in the log.

  50. hebasto approved
  51. hebasto commented at 10:43 AM on December 10, 2020: member

    ACK 2f772934fe4eb7a10667a89afc5884b3f61a6ab0

    Results of src/bench/bench_bitcoin -filter=FindByte on my machine:

    • master (38176dc66523881d45705fe8f96b9da76ec9981b) + this PR
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |               64.12 |       15,596,723.38 |    0.4% |      0.00 | `FindByte
    
    • master (38176dc66523881d45705fe8f96b9da76ec9981b) + this PR w/o the top commit
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |            1,188.64 |          841,300.44 |    0.1% |      0.00 | `FindByte`
    

    CI tests in my personal account passed.

  52. sipa commented at 6:56 PM on June 6, 2021: member

    Rebase to get new CI?

  53. LarryRuane force-pushed on Jun 8, 2021
  54. LarryRuane commented at 8:56 PM on June 8, 2021: contributor

    Rebased.

  55. DrahtBot added the label Needs rebase on Jan 27, 2022
  56. LarryRuane force-pushed on Aug 1, 2022
  57. LarryRuane commented at 9:56 PM on August 1, 2022: contributor

    Force-pushed rebase, and also simplified the code significantly. Here's the previous (more complex) version of this PR, just for the record:

    <details> <summary>Click to expand patch</summary>

    --- a/src/streams.h
    +++ b/src/streams.h
    @@ -795,12 +795,17 @@ public:
     
         //! search for a given byte in the stream, and remain positioned on it
         void FindByte(char ch) {
    +        size_t buf_offset = nReadPos % vchBuf.size();
             while (true) {
    -            if (nReadPos == nSrcPos)
    -                Fill();
    -            if (vchBuf[nReadPos % vchBuf.size()] == ch)
    -                break;
    -            nReadPos++;
    +            if (nReadPos == nSrcPos) Fill();
    +            const size_t len = std::min<size_t>(vchBuf.size() - buf_offset, nSrcPos - nReadPos);
    +            const auto it_start = vchBuf.begin() + buf_offset;
    +            const auto it_find = std::find(it_start, it_start + len, ch);
    +            const size_t inc = it_find - it_start;
    +            nReadPos += inc;
    +            if (inc < len) break;
    +            buf_offset += inc;
    +            if (buf_offset >= vchBuf.size()) buf_offset = 0;
             }
         }
     };
    

    </details>

    The previous version is slightly faster than the current patch, but it's not worth the added complexity. I re-ran the benchmark (added by the first commit in this PR), and the nanoseconds per operation (on my middle-range laptop) is 1,650 on master and 315 with this PR.

  58. DrahtBot removed the label Needs rebase on Aug 1, 2022
  59. in src/bench/streams_findbyte.cpp:24 in 0238e5e548 outdated
      19 | +    b = 1;
      20 | +    fwrite(&b, 1, 1, file);
      21 | +    rewind(file);
      22 | +    CBufferedFile bf(file, file_size * 2, file_size, 0, 0);
      23 | +
      24 | +    bench.minEpochIterations(1e7).run([&] {
    


    stickies-v commented at 12:39 PM on August 17, 2022:

    I don't think we need this many iterations? Bringing this down from 1e7 to 1e5 reduces overall runtime from ~50 seconds to ~0.5 seconds, seemingly without any noticeable difference in ns/op (mean or variance) on two different machines.

  60. in src/bench/streams_findbyte.cpp:22 in 0238e5e548 outdated
      17 | +        fwrite(&b, 1, 1, file);
      18 | +    }
      19 | +    b = 1;
      20 | +    fwrite(&b, 1, 1, file);
      21 | +    rewind(file);
      22 | +    CBufferedFile bf(file, file_size * 2, file_size, 0, 0);
    


    stickies-v commented at 1:16 PM on August 17, 2022:

    It's super small either way, but I don't think the buffer needs to be twice the file size?

        CBufferedFile bf(file, file_size + 1, file_size, 0, 0);
    
  61. in src/bench/streams_findbyte.cpp:20 in 0238e5e548 outdated
      15 | +    uint8_t b = 0;
      16 | +    for (size_t i = 0; i < file_size; ++i) {
      17 | +        fwrite(&b, 1, 1, file);
      18 | +    }
      19 | +    b = 1;
      20 | +    fwrite(&b, 1, 1, file);
    


    stickies-v commented at 1:24 PM on August 17, 2022:

    Slightly more concise and readable imo:

        uint8_t data[file_size] = {0};
        data[file_size-1] = 1;
        fwrite(&data, sizeof(uint8_t), file_size, file);
    

    stickies-v commented at 1:40 PM on August 17, 2022:

    Less memory efficient, but at a file size of 200 bytes I think optimizing for readability is worthwhile?

  62. stickies-v commented at 1:34 PM on August 17, 2022: contributor

    Approach ACK 0238e5e548d221299ccf57bb8e9633507320997f

    On my older x86_64 MacOS laptop, I see very significant performance improvements, speeding up the benchmark from ~1900 ns/op to ~410 ns/op. On my newer M1 Pro MacOS laptop, the difference is much less pronounced: ~940 ns/op -> ~850 ns/op. Did plenty of iterations on both machines, results seem stable.

    Given the simplicity and limited scope of this PR, I think this is still a worthwhile improvement (at no extra maintenance cost) even if it only benefits certain platforms.

    Left a few comments, I think the bench # iterations needs to be reduced before merging, everything else is optional (but seems straightforward to implement too)

  63. achow101 commented at 5:49 PM on January 18, 2023: member

    Are you still working on this?

    There is a silent merge conflict with master, and also unresolved comments.

  64. LarryRuane commented at 6:04 PM on January 18, 2023: contributor

    Are you still working on this?

    Yes, I think this is still worth doing, I'll refresh today, thanks for the ping.

  65. LarryRuane force-pushed on Jan 18, 2023
  66. LarryRuane commented at 8:58 PM on January 18, 2023: contributor

    Force pushed to resolve a hidden conflict, also addressed all review comments.

  67. stickies-v commented at 2:41 PM on January 19, 2023: contributor

    Curiously, this PR is now a significant performance degradation (152ns/op -> 417 ns/op) on my M1 Pro. I repeated the benchmarks on the same commits as my last review, and am getting the same performance degradation there now, too. According to my logs I was on macOS 12.4 on both dates. Results are stable enough between runs.

    On my x86 (Debian Bullseye) machine, I'm still seeing significant performance improvement (1.717 ns/op -> 338 ns/op), as before.

    <details> <summary>bench results</summary>

    M1 Pro @ 5842d92c8 (unpatched)

    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |              151.60 |        6,596,094.28 |    0.4% |      0.01 | `FindByte`
    

    M1 Pro @ 08cf6a9ed812 (patched)

    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |              417.00 |        2,398,083.95 |    0.2% |      0.01 | `FindByte`
    

    x86 (Debian Bullseye) @ 5842d92c8 (unpatched)

    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |            1,716.73 |          582,504.42 |    0.0% |      0.01 | `FindByte`
    

    x86 (Debian Bullseye) @ 08cf6a9ed812 (patched)

    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |              337.99 |        2,958,650.44 |    1.5% |      0.01 | `FindByte`
    

    </details>

  68. LarryRuane commented at 7:57 PM on January 20, 2023: contributor

    Curiously, this PR is now a significant performance degradation

    As part of the last force-push I made this change:

    --- b/src/bench/streams_findbyte.cpp
    +++ a/src/bench/streams_findbyte.cpp
    @@ -18,7 +18,7 @@ static void FindByte(benchmark::Bench& bench)
         rewind(file);
         CBufferedFile bf(file, /*nBufSize=*/file_size + 1, /*nRewindIn=*/file_size, 0, 0);
     
    -    bench.minEpochIterations(1e7).run([&] {
    +    bench.run([&] {
             bf.SetPos(0);
             bf.FindByte(1);
         });
    

    Is it possible the test isn't running enough iterations to get an accurate measurement? Although, still, you wouldn't expect the patch to be worse!

    Can you try with minEpochIterations(1e5) (or even back to 1e7) and see if you get results consistent with before? I probably should restore the iterations count (and set it to 1e5 as you suggest).

    I get consistently 136 ns/op with the patch, and 300 ns/op without the patch, with either iterations set to 1e5 or 1e7. This is Debian Ubuntu jammy AMD x86 desktop.

  69. stickies-v commented at 12:11 AM on January 22, 2023: contributor

    Can you try with minEpochIterations(1e5) (or even back to 1e7) and see if you get results consistent with before?

    I'd tried that already, but the results are pretty much the same.

  70. fanquake commented at 10:39 AM on February 28, 2023: member

    What are the next steps here? Anyone want to look into why this is apparently a performance degredation for ARM macOS? Maybe @john-moffett?

  71. john-moffett commented at 6:47 PM on February 28, 2023: contributor

    I think it's mainly an architecture issue. I'm seeing similar performance degradation on my M1 to what @stickies-v reported. It's possible there could be pipeline issues with the conditional selection instructions that get generated by this line:

    if (++buf_offset >= vchBuf.size()) buf_offset = 0;

    The std::find version (eg - https://github.com/bitcoin/bitcoin/commit/2f772934fe4eb7a10667a89afc5884b3f61a6ab0) does not have the same issue, and I see significant performance gains in both x86 and ARM. Specifically, on my M1, master is 145 ns/op, 08cf6a9ed812c122e9d47cccee09e03d8139764c is 420 ns/op, and the std::find version is 81 ns/op.

    My suggestion would be to switch back to that version.

  72. LarryRuane force-pushed on Mar 2, 2023
  73. LarryRuane commented at 6:17 PM on March 2, 2023: contributor

    @john-moffett - Good idea to bring back the earlier commit (the condition branch instruction theory makes sense); I just restored (force-pushed) it as you suggested. On my x86 (ns/op, lower is better):

    master: 307 previous version of this PR (08cf6a9ed812c122e9d47cccee09e03d8139764c): 135 current version (using std::find): 34

  74. achow101 commented at 4:13 PM on March 15, 2023: member

    ACK dacd3316ca0279c56d12372957633b73a999b5b2

  75. DrahtBot requested review from hebasto on Mar 15, 2023
  76. DrahtBot requested review from laanwj on Mar 15, 2023
  77. DrahtBot requested review from sipa on Mar 15, 2023
  78. in src/streams.h:747 in dacd3316ca outdated
     743 | @@ -744,13 +744,22 @@ class CBufferedFile
     744 |      //! search for a given byte in the stream, and remain positioned on it
     745 |      void FindByte(uint8_t ch)
     746 |      {
     747 | +        const std::byte byte{static_cast<std::byte>(ch)};
    


    stickies-v commented at 6:47 PM on March 15, 2023:

    Given that there is only a single non-test callsite of FindByte(), would it make sense to just update the fn signature to take std::byte directly?

  79. in src/streams.h:748 in dacd3316ca outdated
     743 | @@ -744,13 +744,22 @@ class CBufferedFile
     744 |      //! search for a given byte in the stream, and remain positioned on it
     745 |      void FindByte(uint8_t ch)
     746 |      {
     747 | +        const std::byte byte{static_cast<std::byte>(ch)};
     748 | +        size_t buf_offset = m_read_pos % vchBuf.size();
    


    stickies-v commented at 6:47 PM on March 15, 2023:

    nit: and a bunch more of those

            size_t buf_offset{m_read_pos % vchBuf.size()};
    
  80. in src/streams.h:758 in dacd3316ca outdated
     743 | @@ -744,13 +744,22 @@ class CBufferedFile
     744 |      //! search for a given byte in the stream, and remain positioned on it
    


    stickies-v commented at 11:44 AM on March 16, 2023:

    I think some of the rationale of this implementation should be in the docs so future contributors don't simplify the code again to unintentionally undo the performance gains, e.g. why the modulo operator is kept outside of the while loop seems quite important and non-trivial?


    stickies-v commented at 12:14 PM on March 16, 2023:

    Unrelated to this PR, but I think it would also be helpful to improve the docs to specify that if ch is not found, std::ios_base::failure is thrown (from Fill()). It's an essential part of the interface imo.

    Perhaps worth improving on this behaviour, and have FindByte() throw its own error, by wrapping the Fill() command in a try/catch? Orthogonal to this PR, though. (And I also don't like that we're catching a general std::exception for a FindByte() failure, but again, orthogonal.)

  81. stickies-v commented at 12:21 PM on March 16, 2023: contributor

    Approach ACK dacd3316c

    I'm now seeing bench performance improvement on my M1 again, from ~150ns/op -> ~85ns/op.

  82. LarryRuane force-pushed on Mar 20, 2023
  83. LarryRuane commented at 4:08 AM on March 20, 2023: contributor

    Force-pushed for review comments (thanks, @stickies-v), verified benchmark performance is unchanged (ns/op with PR: 34.76, without PR: 302). Summary of force-push changes:

    • change FindByte() argument type from uint8_t to std::byte)
    • add "exception" comment before call to Fill()
    • add comment suggesting to avoid mod (%) operator within loop
    • changed assignment statements to use more modern braces syntax
  84. LarryRuane force-pushed on Mar 20, 2023
  85. LarryRuane force-pushed on Mar 20, 2023
  86. LarryRuane commented at 8:08 PM on March 20, 2023: contributor

    Force-pushed again to fix CI failures.

  87. in src/streams.h:1 in 0fe832c4a4


    stickies-v commented at 5:07 PM on March 21, 2023:

    Instead of changing the callsites of FindByte(), how about adding a uint8_t overload? I think it keeps the implementation clean, but since it can easily be argued that uint8_t also is a byte, this keeps the callsites straightforward and reduces the diff.

    <details> <summary>git diff</summary>

    diff --git a/src/bench/streams_findbyte.cpp b/src/bench/streams_findbyte.cpp
    index 175564fe9..7b2e65da2 100644
    --- a/src/bench/streams_findbyte.cpp
    +++ b/src/bench/streams_findbyte.cpp
    @@ -20,7 +20,7 @@ static void FindByte(benchmark::Bench& bench)
     
         bench.run([&] {
             bf.SetPos(0);
    -        bf.FindByte(std::byte(1));
    +        bf.FindByte(1);
         });
     
         // Cleanup
    diff --git a/src/streams.h b/src/streams.h
    index 2558bd830..9280fa013 100644
    --- a/src/streams.h
    +++ b/src/streams.h
    @@ -763,6 +763,8 @@ public:
                 if (buf_offset >= vchBuf.size()) buf_offset = 0;
             }
         }
    +
    +    void FindByte(uint8_t byte) { return FindByte(static_cast<std::byte>(byte)); }
     };
     
     #endif // BITCOIN_STREAMS_H
    diff --git a/src/test/fuzz/buffered_file.cpp b/src/test/fuzz/buffered_file.cpp
    index 2f7ce60c7..67cac8fa4 100644
    --- a/src/test/fuzz/buffered_file.cpp
    +++ b/src/test/fuzz/buffered_file.cpp
    @@ -53,7 +53,7 @@ FUZZ_TARGET(buffered_file)
                             return;
                         }
                         try {
    -                        opt_buffered_file->FindByte(std::byte(fuzzed_data_provider.ConsumeIntegral<uint8_t>()));
    +                        opt_buffered_file->FindByte(fuzzed_data_provider.ConsumeIntegral<uint8_t>());
                         } catch (const std::ios_base::failure&) {
                         }
                     },
    diff --git a/src/test/streams_tests.cpp b/src/test/streams_tests.cpp
    index 79bc7b7c0..1db5b61f1 100644
    --- a/src/test/streams_tests.cpp
    +++ b/src/test/streams_tests.cpp
    @@ -462,7 +462,7 @@ BOOST_AUTO_TEST_CASE(streams_buffered_file_rand)
                     size_t find = currentPos + InsecureRandRange(8);
                     if (find >= fileSize)
                         find = fileSize - 1;
    -                bf.FindByte(std::byte(find));
    +                bf.FindByte(find);
                     // The value at each offset is the offset.
                     BOOST_CHECK_EQUAL(bf.GetPos(), find);
                     currentPos = find;
    diff --git a/src/validation.cpp b/src/validation.cpp
    index a79b81add..b42b39861 100644
    --- a/src/validation.cpp
    +++ b/src/validation.cpp
    @@ -4438,7 +4438,7 @@ void Chainstate::LoadExternalBlockFile(
                 try {
                     // locate a header
                     unsigned char buf[CMessageHeader::MESSAGE_START_SIZE];
    -                blkdat.FindByte(std::byte(params.MessageStart()[0]));
    +                blkdat.FindByte(params.MessageStart()[0]);
                     nRewind = blkdat.GetPos() + 1;
                     blkdat >> buf;
                     if (memcmp(buf, params.MessageStart(), CMessageHeader::MESSAGE_START_SIZE)) {
    
    

    </details>

  88. in src/streams.h:767 in 0fe832c4a4 outdated
     750 |          while (true) {
     751 | -            if (m_read_pos == nSrcPos)
     752 | +            if (m_read_pos == nSrcPos) {
     753 | +                // No more bytes available; read from the file into the buffer,
     754 | +                // setting nSrcPos to one beyond the end of the new data.
     755 | +                // Throws exception if end-of-file reached.
    


    stickies-v commented at 5:08 PM on March 21, 2023:

    Given that it's part of the interface, I think this needs to be documented on the function level so devs wanting to use FindByte know how it behaves when the byte isn't found - they shouldn't need to dive into the implementation.

  89. in src/streams.h:761 in 0fe832c4a4 outdated
     741 | @@ -742,15 +742,25 @@ class CBufferedFile
     742 |      }
     743 |  
     744 |      //! search for a given byte in the stream, and remain positioned on it
     745 | -    void FindByte(uint8_t ch)
     746 | +    void FindByte(std::byte byte)
     747 |      {
     748 | +        // For best performance, avoid mod operation within the loop.
    


    stickies-v commented at 5:12 PM on March 21, 2023:

    nit

            // The modulus operation is much more expensive than byte
            // comparison and addition, so we keep it out of the loop to
            // improve performance (see [#19690](/bitcoin-bitcoin/19690/) for discussion).
    
  90. stickies-v approved
  91. stickies-v commented at 1:38 PM on March 22, 2023: contributor

    ACK 0fe832c4a

  92. DrahtBot requested review from achow101 on Mar 22, 2023
  93. achow101 commented at 6:35 PM on April 21, 2023: member

    ACK 0fe832c4a4b2049fdf967bca375468d5ac285563

  94. DrahtBot removed review request from achow101 on Apr 21, 2023
  95. john-moffett approved
  96. john-moffett commented at 7:04 PM on April 21, 2023: contributor

    ACK 0fe832c4a4b2049fdf967bca375468d5ac285563

  97. achow101 commented at 10:45 PM on April 21, 2023: member

    Silent merge conflict with master:

    ../../../src/bench/streams_findbyte.cpp:7:10: fatal error: fs.h: No such file or directory
        7 | #include <fs.h>
          |          ^~~~~~
    compilation terminated.
    
  98. [bench] add streams findbyte 604df63f6c
  99. util: improve streams.h:FindByte() performance
    Avoid use of the expensive mod operator (%) when calculating the
    buffer offset. No functional difference.
    
    Co-authored-by: Hennadii Stepanov <32963518+hebasto@users.noreply.github.com>
    72efc26439
  100. LarryRuane force-pushed on May 5, 2023
  101. LarryRuane commented at 12:04 PM on May 5, 2023: contributor

    Force pushed rebase to fix hidden merge conflict, thanks @achow101

  102. stickies-v approved
  103. stickies-v commented at 1:16 PM on May 5, 2023: contributor

    re-ACK 72efc26439da9a1344a19569fb0cab01f82ae7d1

    Verified that the only difference is to include <util/fs.h> instead of <fs.h> (introduced by https://github.com/bitcoin/bitcoin/pull/27254/commits/00e9b97f37e0bdf4c647236838c10b68b7ad5be3)

    % git range-diff HEAD~2 0fe832c4a4b2049fdf967bca375468d5ac285563 HEAD
    1:  5842d92c8 ! 1:  604df63f6 [bench] add streams findbyte
        @@ src/bench/streams_findbyte.cpp (new)
         +
         +#include <bench/bench.h>
         +
        -+#include <fs.h>
        ++#include <util/fs.h>
         +#include <streams.h>
         +
         +static void FindByte(benchmark::Bench& bench)
    2:  0fe832c4a = 2:  72efc2643 util: improve streams.h:FindByte() performance
    
  104. DrahtBot requested review from achow101 on May 5, 2023
  105. DrahtBot requested review from john-moffett on May 5, 2023
  106. achow101 commented at 9:40 PM on May 10, 2023: member

    re-ACK 72efc26439da9a1344a19569fb0cab01f82ae7d1

  107. DrahtBot removed review request from achow101 on May 10, 2023
  108. achow101 merged this on May 10, 2023
  109. achow101 closed this on May 10, 2023

  110. sidhujag referenced this in commit 7dc3f777e0 on May 11, 2023
  111. bitcoin locked this on May 9, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-13 15:14 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me