This PR is strictly a performance improvement; there is no functional change. The CBufferedFile::FindByte() method searches for the next occurrence of the given byte in the file. Currently, this is done by explicitly inspecting each byte in turn. This PR takes advantage of std::find() to do the same more efficiently, improving its CPU runtime by a factor of about 25 in typical use.
util: improve FindByte() performance #19690
pull LarryRuane wants to merge 2 commits into bitcoin:master from LarryRuane:FindByte-speedup changing 6 files +50 −8-
LarryRuane commented at 3:49 AM on August 10, 2020: contributor
-
LarryRuane commented at 3:50 AM on August 10, 2020: contributor
This PR was suggested by @hebasto #16981 (comment) (thank you!)
- fanquake added the label Utils/log/libs on Aug 10, 2020
-
in src/streams.h:842 in ab412ec307 outdated
840 | - if (vchBuf[nReadPos % vchBuf.size()] == ch) 841 | - break; 842 | - nReadPos++; 843 | + size_t n = vchBuf.size() - start; 844 | + if (n > nSrcPos - nReadPos) 845 | + n = nSrcPos - nReadPos;
promag commented at 8:09 AM on August 10, 2020:nit, join with above line.
promag commented at 8:15 AM on August 10, 2020: memberThe performance gain looks substantial - didn't verified.
hebasto commented at 11:10 AM on August 10, 2020: member... improving its CPU runtime by a factor of about 25 in typical use.
How was it measured?
laanwj commented at 11:38 AM on August 10, 2020: memberAny idea why this is so much faster? As far as I know, there is no faster algorithm to look for the first occurrence of a single byte in a memory array than a linear iteration over it. I'd expect
std::findof a byte to simply unroll into a loop.LarryRuane commented at 2:38 PM on August 10, 2020: contributorHow was it measured?
I ran:
time src/test/test_bitcoin --run_test=streams_tests/streams_findbytewith master (1m20s) and with master + PR (3s).
Any idea why this is so much faster?
I think you're correct that there's no faster way than a loop with runtime proportional to the number of bytes to scan, but I assume
std::find()on a char vector is highly optimized, probably usingmemchr()ormemcmp(), which are implemented in assembly language. Also, the master version does a few things each byte (testingnReadPos == nSrcPos, remainder calculation (%), incrementingnReadPos) that the PR does once for each large run of bytes.I just noticed that the repetition count on the test is set to a large number (50000000) and I meant to reduce it for the commit (3 seconds is too long to add to the unit test suite). I'll reduce that number in force push in a minute. This test doesn't really need to be in this PR (
FindByte()'s functionality is tested very well in another test), but it helps reviewers verify the performance improvement.LarryRuane force-pushed on Aug 10, 2020LarryRuane commented at 2:46 PM on August 10, 2020: contributorForce-push a small fix to the test, so it doesn't take 3 seconds to run.
laanwj commented at 4:43 PM on August 10, 2020: memberbut I assume std::find() on a char vector is highly optimized, probably using memchr() or memcmp(), which are implemented in assembly language
That's true, it's possible to optimize that with assembly language (definitely with specific instruction sets).
it still surprises me because you'd expect the I/O, to read the data from disk, to dominate greatly in the block importing. Not looking for a character already in memory! It just seems out of proportion.
in src/test/streams_tests.cpp:463 in a31aa32554 outdated
458 | + } 459 | + b = 1; 460 | + fwrite(&b, 1, 1, file); 461 | + rewind(file); 462 | + CBufferedFile bf(file, fileSize * 2, fileSize, 0, 0); 463 | + for (int rep = 0; rep < 100; ++rep) {
sipa commented at 8:12 PM on August 10, 2020:The performance impact may be very compiler/architecture/stdlib dependent. I'm kind of surprised std::find has optimizations beyond the naive loop implementation in the first place on some platforms, so I certainly wouldn't be surprised if others don't have it.
LarryRuane commented at 9:39 PM on August 10, 2020:@gzhao408, thank you, I wasn't aware of
bench. I suspect the iteration count, 100, is far too low and the difference is swamped by the noise. I increased the iteration count to 10m (1e7) and it showed the expected difference, master: 1,659.30 ns/op, PR: 52.66 ns/op (ratio is about 31).I just force-pushed (diff) to remove the unit test (which was only for benchmarking, not really testing anything) and cherry-pick Gloria's benchmark. I added one more commit to increase the iteration count.
LarryRuane commented at 8:08 PM on August 10, 2020: contributorit still surprises me because you'd expect the I/O, to read the data from disk, to dominate greatly in the block importing.
FindByte()only reads from disk by callingFill()(when the buffer is empty), which is rare. In this test,Fill()gets called only once, the first timeFindByte()runs, because I wanted to isolate the modified part of the code.LarryRuane force-pushed on Aug 10, 2020sipa commented at 12:42 AM on August 11, 2020: member@gmaxwell pointed out to me why this is so much faster: it's not that std::find is amazing, but that the original code (which I wrote in 2012, it seems!) is doing a modulus operation for every character (which is often orders of magnitude slower than the byte comparison or addition/subtraction).
Thinking about this a bit more high level: the end goal is just to scan quickly for the 4-byte network magic in a file. If this is relevant for performance (and it seems it may be), it may be better to have a function that does exactly that in CBufferedFile, rather than a search for one byte + memcmp.
std::searchis probably what you want.LarryRuane commented at 2:10 PM on August 12, 2020: contributorHere's a version that's very close in implementation to master that eliminates the
%operation on every character:void FindByte(char ch) { size_t start = nReadPos % vchBuf.size(); while (true) { if (nReadPos == nSrcPos) Fill(); if (vchBuf[start] == ch) break; nReadPos++; start++; if (start >= vchBuf.size()) start = 0; }and that does make a significant difference;
bench_bitcoinreports 314 ns/op for this version. (My laptop isn't set up for very accurate CPU benchmarking, but my results are pretty consistent, varying by less than 1% across multiple runs.) That's about 5.3 times as fast as master (1,659.30 ns/op, as mentioned above). The PR (52.66 ns/op) is still about 6 times faster than this version. So this version is right about in the middle (in terms of ratios) between master and this PR.I do like @sipa's suggestion to generalize
FindByte()to search for a given sequence of bytes (maybe calledFindBytes()orSearch(), suggestions welcome); that's a much nicer interface. I'll push a commit to do that within the next few hours.LarryRuane commented at 5:35 AM on August 13, 2020: contributorI just added a new commit (25413ab, can squash later) to implement @sipa's suggestion to add a method to
CBufferedFileto find a sequence of bytes, rather than just one byte, asFindByte()does. This simplifiesLoadExternalBlockFile(); the overall code base isn't simpler, but it encapsulates complexity nicely within theCBufferedFileclass.It didn't work out to use
std::search()because that function requires the bytes being searched to be consecutive in memory -- that works if the entire file is read into contiguous memory. ButCBufferedFileimplements a circular buffer. The bytes we're searching for might be partially in and partially out of memory (bytes are read on demand), and even if all are in memory, they may be split between the end of the buffer and the beginning (because of wraparound), so trying to usestd::searchgets complicated.FindByte()doesn't have these problems since it's looking for a single byte.adamjonas commented at 12:39 AM on August 14, 2020: memberLooks like one of the sanitizers is finding an implicit-integer-sign-change problem in the latest update:
�[0;34m node0 stderr streams.h:857:22: runtime error: implicit conversion from type 'unsigned char' of value 250 (8-bit, unsigned) to type 'char' changed the value to -6 (8-bit, signed) [#0](/bitcoin-bitcoin/0/) 0x55f6d4cc6285 in CBufferedFile::Search(unsigned char const*, unsigned long) /tmp/cirrus-ci-build/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/src/./streams.h:857:22 [#1](/bitcoin-bitcoin/1/) 0x55f6d4c967a0 in LoadExternalBlockFile(CChainParams const&, _IO_FILE*, FlatFilePos*) /tmp/cirrus-ci-build/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/src/validation.cpp:4682:24 [#2](/bitcoin-bitcoin/2/) 0x55f6d45071f6 in ThreadImport(ChainstateManager&, std::vector<boost::filesystem::path, std::allocator<boost::filesystem::path> >) /tmp/cirrus-ci-build/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/src/init.cpp:705:13 [#3](/bitcoin-bitcoin/3/) 0x55f6d4506814 in AppInitMain(util::Ref const&, NodeContext&)::$_10::operator()() const /tmp/cirrus-ci-build/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/src/init.cpp:1853:96 [#4](/bitcoin-bitcoin/4/) 0x55f6d4506814 in std::_Function_handler<void (), AppInitMain(util::Ref const&, NodeContext&)::$_10>::_M_invoke(std::_Any_data const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/std_function.h:300:2 [#5](/bitcoin-bitcoin/5/) 0x55f6d4585b29 in std::function<void ()>::operator()() const /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/std_function.h:688:14 [#6](/bitcoin-bitcoin/6/) 0x55f6d45181bb in void TraceThread<std::function<void ()> >(char const*, std::function<void ()>) /tmp/cirrus-ci-build/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/src/./util/system.h:438:9 [#7](/bitcoin-bitcoin/7/) 0x55f6d4506447 in void std::__invoke_impl<void, void (*)(char const*, std::function<void ()>), char const*, AppInitMain(util::Ref const&, NodeContext&)::$_10>(std::__invoke_other, void (*&&)(char const*, std::function<void ()>), char const*&&, AppInitMain(util::Ref const&, NodeContext&)::$_10&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:60:14 [#8](/bitcoin-bitcoin/8/) 0x55f6d450616a in std::__invoke_result<void (*)(char const*, std::function<void ()>), char const*, AppInitMain(util::Ref const&, NodeContext&)::$_10>::type std::__invoke<void (*)(char const*, std::function<void ()>), char const*, AppInitMain(util::Ref const&, NodeContext&)::$_10>(void (*&&)(char const*, std::function<void ()>), char const*&&, AppInitMain(util::Ref const&, NodeContext&)::$_10&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:95:14 [#9](/bitcoin-bitcoin/9/) 0x55f6d4505db2 in void std::thread::_Invoker<std::tuple<void (*)(char const*, std::function<void ()>), char const*, AppInitMain(util::Ref const&, NodeContext&)::$_10> >::_M_invoke<0ul, 1ul, 2ul>(std::_Index_tuple<0ul, 1ul, 2ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:244:13 [#10](/bitcoin-bitcoin/10/) 0x55f6d4505db2 in std::thread::_Invoker<std::tuple<void (*)(char const*, std::function<void ()>), char const*, AppInitMain(util::Ref const&, NodeContext&)::$_10> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:251:11 [#11](/bitcoin-bitcoin/11/) 0x55f6d4505db2 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(char const*, std::function<void ()>), char const*, AppInitMain(util::Ref const&, NodeContext&)::$_10> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:195:13 [#12](/bitcoin-bitcoin/12/) 0x7f20f2f18cb3 (/lib/x86_64-linux-gnu/libstdc++.so.6+0xd6cb3) [#13](/bitcoin-bitcoin/13/) 0x7f20f3317608 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x9608) [#14](/bitcoin-bitcoin/14/) 0x7f20f2bf5102 in clone (/lib/x86_64-linux-gnu/libc.so.6+0x122102) SUMMARY: UndefinedBehaviorSanitizer: implicit-integer-sign-change streams.h:857:22LarryRuane force-pushed on Aug 14, 2020LarryRuane commented at 6:05 AM on August 14, 2020: contributorThanks, @adamjonas, force-pushed a fix for that signed-unsigned CI failure, added another unit test, some improvements to
Search(), someclang-format-diff.pycleanups.DrahtBot commented at 8:44 PM on August 20, 2020: contributor<!--e57a25ab6845829454e8d69fc972939a-->
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
<!--021abf342d371248e50ceaed478a90ca-->
Reviews
See the guideline for information on the review process.
Type Reviewers ACK stickies-v, achow101 Stale ACK laanwj, sipa, hebasto, john-moffett If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
<!--174a7506f384e20aa4161008e828411d-->
Conflicts
No conflicts as of last run.
in src/bench/streams_findbyte.cpp:13 in 8f2e8c27bb outdated
7 | +#include <test/util/setup_common.h> 8 | + 9 | +static void FindByte(benchmark::Bench& bench) 10 | +{ 11 | + // Setup 12 | + FILE* file = fsbridge::fopen("streams_tmp", "w+b");
elichai commented at 10:09 AM on August 30, 2020:Could we maybe use something like
memfd_create(2)for the benchmark, to decrease I/O noise? Anyone knows if there is a windows equivalent or a higher abstraction in boost::filesystem?
elichai commented at 10:12 AM on August 30, 2020:fmemopen(3)also looks interesting
elichai commented at 10:24 AM on August 30, 2020:Maybe this can give us that? https://www.boost.org/doc/libs/1_72_0/libs/iostreams/doc/classes/mapped_file.html
LarryRuane commented at 3:28 PM on August 31, 2020:Thanks for the suggestions, but I don't think this matters, because the file IO (read) occurs only on the very first benchmark loop iteration; after that, the data is in memory and there is no IO at all. Each iteration repositions the stream pointer to zero (
bf.SetPos(0)) and then searches forward for the value 1 (bf.FindByte(1)), which is 200 bytes away. But after the first iteration, all of these bytes are in memory, and we're just moving the position between 0 and 200.The reason for the 200, by the way, is that with random data, searching for a random byte value the average distance would be half of 256. But the data in
blk.datfiles (which is the only use of this code currently) isn't quite random; zero and0xffoccur more often (and we never search for those values). So I guessed that 200 is close to a typical distance that this function would move through before finding the requested byte. Maybe I should explain these points in a comment inbench/streams_findbyte.cpp.laanwj commented at 4:08 AM on November 20, 2020: memberCode review ACK 8f2e8c27bb9bb3ebb9f094129ce4a6019c46cfc0, this looks like a better abstraction, and it's an improvement not to do a modulus for every byte.
LarryRuane force-pushed on Dec 4, 2020LarryRuane commented at 9:33 PM on December 4, 2020: contributorForce-pushed to rationalize the commits, so I think it's in a good state for merging. I reordered the commit that adds the benchmark test, b576994bea620c7d065dd558c93222391151c9ff, to be first, so it's easy for reviewers to checkout that commit and run the new benchmark test without the improvements:
src/bench/bench_bitcoin -filter=FindByteI re-ran those tests just now, and on my laptop, the ns/op without the PR is 1,840, and with the PR it's 55 (an improvement of more than a factor of 33).
sipa commented at 10:33 PM on December 4, 2020: memberCode review ACK 566c2e2eec1e84af5143f1554613a89192e29433. I think it'd be good to move the refactoring to FindByte in the last commit to a separate commit.
LarryRuane force-pushed on Dec 5, 2020LarryRuane commented at 1:14 AM on December 5, 2020: contributorForce-pushed to implement latest review suggestion, changes are more cleanly separated among the commits (no code changes overall), thanks @sipa.
in src/bench/streams_findbyte.cpp:33 in b576994bea outdated
28 | + // Cleanup 29 | + bf.fclose(); 30 | + fs::remove("streams_tmp"); 31 | +} 32 | + 33 | +BENCHMARK(FindByte);
hebasto commented at 2:28 PM on December 5, 2020:b576994bea620c7d065dd558c93222391151c9ff, nit: Add EOL

in src/test/streams_tests.cpp:342 in 134de90f2b outdated
337 | + bf.FindByte(99); 338 | + BOOST_CHECK(false); 339 | + } catch (const std::exception& e) { 340 | + BOOST_CHECK(strstr(e.what(), 341 | + "CBufferedFile::Fill: end of file") != nullptr); 342 | + }
hebasto commented at 2:30 PM on December 5, 2020:134de90f2b64ede6dbd812a9a88b00708685394d
BOOST_CHECK_EXCEPTION(bf.FindByte(99), std::ios_base::failure, HasReason("CBufferedFile::Fill: end of file"));```in src/test/streams_tests.cpp:363 in 134de90f2b outdated
358 | + bf.Search(needle, 2); 359 | + BOOST_CHECK(false); 360 | + } catch (const std::exception& e) { 361 | + BOOST_CHECK(strstr(e.what(), 362 | + "CBufferedFile::Fill: end of file") != nullptr); 363 | + }
hebasto commented at 2:30 PM on December 5, 2020:134de90f2b64ede6dbd812a9a88b00708685394d
BOOST_CHECK_EXCEPTION(bf.Search(needle, 2), std::ios_base::failure, HasReason("CBufferedFile::Fill: end of file"));hebasto changes_requestedhebasto commented at 2:31 PM on December 5, 2020: memberConcept ACK.
in src/streams.h:614 in 134de90f2b outdated
717 | @@ -718,7 +718,7 @@ class CBufferedFile 718 | uint64_t nReadPos; //!< how many bytes have been read from this 719 | uint64_t nReadLimit; //!< up to which position we're allowed to read 720 | uint64_t nRewind; //!< how many bytes we guarantee to rewind 721 | - std::vector<char> vchBuf; //!< the buffer 722 | + std::vector<unsigned char> vchBuf; //!< the buffer 723 | 724 | protected:
hebasto commented at 3:54 PM on December 5, 2020:While the
CBufferedFileclass is already touched, maybe makeCBufferedFile::Fillprivate by removing this line?practicalswift commented at 4:55 PM on December 5, 2020: contributorThanks for providing promising micro benchmarks. AFAICT
FindByteis only used when doing-reindexand-loadblock. Do you have any macro numbers too where the gains from switching to this improved version are measured in terms of the overall speed-up of said operations?Another thing: when fuzzing
CBufferedFileI noticed that callingFindByteafter a failedSetPoscall may result in an infinite loop. Does the new code have that gotcha too? If not we can drop thesetpos_faillogic insrc/test/fuzz/buffered_file.cpp:)hebasto commented at 5:01 PM on December 5, 2020: memberAnother thing: when fuzzing
CBufferedFileI noticed that callingFindByteafter a failedSetPoscall may result in an infinite loop. Does the new code have that gotcha too? If not we can drop thesetpos_faillogic insrc/test/fuzz/buffered_file.cpp:)Also some failed tests:
LarryRuane marked this as a draft on Dec 6, 2020LarryRuane commented at 6:30 AM on December 7, 2020: contributorDo you have any macro numbers too where the gains from switching to this improved version are measured in terms of the overall speed-up of [
-reindexand-loadblock]?I don't think there will be any measurable improvement in the sunny-day case, because the byte being searched for is the very first byte at the stream pointer. So no looping is needed (and this PR only improves the performance of that loop). But if there is corruption, damage in the
blk.datfiles, thenFindByte()is being used to "resynchronize" with the stream, that is, to locate the start of the next valid block. I don't think that happens often in practice. But another advantage of this PR is thatFindByte()may be used for something else in the future which does require actual looping in the sunny-day case.Another thing: when fuzzing
CBufferedFileI noticed that callingFindByteafter a failedSetPoscall may result in an infinite loop. Does the new code have that gotcha too?TL;DR: No. We can remove that check.
Details: I wanted to understand this better, so I checked out master immediately after your PR #19143 was merged, the commit where you added the
SetPos()fuzzing (and thesetpos_failflag) (commit 614e0807a8137d82832aea45e4864b424f71f698), and did some investigation, and discovered that a failedSetPos()isn't what caused the infinite loop. The reason thesetpos_failflag seemed to fix the problem is thatSetPos()typically fails very early in the fuzzing run, and that disables furtherFindByte()calls, after which the problem can't occur. (I didn't see anything that would prevent the infinite loop, but it's just unlikely.)The actual source of the problem is in the
Seek()call. The implementation of that method calls the standard libraryfseek(), then, if that succeeds,ftell(), which is supposed to return the byte offset into the file of theFILEpointer. It sometimes returns-1(which indicates error). That seems wrong, because if thefseek()is successful, thenftell()should return whatever offset we just seeked to. (I'm not sure whySeek()even calledftell(), since it already knows that offset.) This-1(error) return is stored into both theCBufferedFile's read and write offsets (nReadPosandnSrcPos), which are unsigned, so things are messed up from here. (I was able to see why this caused an infinite loop, but it's not worth explaining here.)Anyway, this is all irrelevant now, because in #19593, @hebasto removed
CBufferedFile::Seekbecause it wasn't being used. So, yes, we should remove this check (I did remove it and verified that there is no infinite loop), and I'll do that in this PR.Another thing I'd like to do is add fuzz testing for the new method
CBufferedFile::Search()(it can be added to the existing test). That's why I changed this PR status to DRAFT. Also, I'll look into if those failed tests are real failures. Thanks for all your help!LarryRuane force-pushed on Dec 7, 2020LarryRuane commented at 5:51 PM on December 7, 2020: contributorForce-pushed the latest review suggestions:
- remove unneeded
protected:tag - remove unneeded
setpos_failfrom fuzz test - add fuzz test for new
CBufferedFile::Search()method - simplify checking for expected exception in unit test (use
BOOST_CHECK_EXCEPTION) @hebasto
Also some failed tests
I could not figure out those failures; I don't see how they have any relation to this PR.
LarryRuane marked this as ready for review on Dec 7, 2020in src/bench/streams_findbyte.cpp:7 in 918de11daa outdated
0 | @@ -0,0 +1,33 @@ 1 | +// Copyright (c) 2020 The Bitcoin Core developers 2 | +// Distributed under the MIT software license, see the accompanying 3 | +// file COPYING or http://www.opensource.org/licenses/mit-license.php. 4 | + 5 | +#include <bench/bench.h> 6 | +#include <streams.h> 7 | +#include <test/util/setup_common.h>
hebasto commented at 2:31 PM on December 8, 2020:b576994bea620c7d065dd558c93222391151c9ff
#include <bench/bench.h> #include <fs.h> #include <stdio.h> #include <streams.h>in src/bench/streams_findbyte.cpp:13 in 918de11daa outdated
8 | + 9 | +static void FindByte(benchmark::Bench& bench) 10 | +{ 11 | + // Setup 12 | + FILE* file = fsbridge::fopen("streams_tmp", "w+b"); 13 | + const size_t fileSize = 200;
hebasto commented at 2:33 PM on December 8, 2020:b576994bea620c7d065dd558c93222391151c9ff
const size_t file_size = 200;in src/streams.h:848 in 918de11daa outdated
848 | + size_t inc = it_find - it_start; 849 | + nReadPos += inc; 850 | + if (inc < n) break; 851 | + i += inc; 852 | + if (i >= vchBuf.size()) i = 0; 853 | + }
hebasto commented at 2:44 PM on December 8, 2020:865ed617512edb9e62dcce61aa48818c4ef3a295
Some
ifstatements could be eliminated :)void FindByte(unsigned char ch) { while (true) { if (nReadPos == nSrcPos) Fill(); const size_t i = nReadPos % vchBuf.size(); const size_t n = std::min<size_t>(vchBuf.size() - i, nSrcPos - nReadPos); const auto it_start = vchBuf.begin() + i; const auto it_find = std::find(it_start, it_start + n, ch); const size_t inc = it_find - it_start; nReadPos += inc; if (inc < n) break; } }It seems like names
iandncould be more descriptive, no? But I have no a particular suggestion :)hebasto changes_requestedhebasto commented at 5:13 PM on December 8, 2020: memberApproach ACK 918de11daa4be9995122c5e4b0284285b27e0e8f
While testing this branch on top of the master still having a fuzzing problem:
load_external_block_filetimeout. An infinite loop?If other reviewers are not opposed, I'd suggest to rebase this PR on top of the recent CI changes, in order to make this problem obvious to all reviewers.
As
bench/streams_findbyte.cppis a new code, I believe it should follow our Developer Notes (header including and naming). If it'd be re-touched the adding an EOL could be moved from the third commit into the first one. Also the first commit message could contain a verb "Add" :)To get this PR closer to merging, maybe just drop changes that introduce the
CBufferedFile::Searchmember function, as the performance issue has been resolved in the second commit?LarryRuane force-pushed on Dec 10, 2020LarryRuane commented at 2:33 AM on December 10, 2020: contributor@hebasto, thank you for your careful review! I force-pushed your suggestions except for moving
nReadPos % vchBuf.size();inside the loop. That does reduce the amount of code slightly, but the benchmark test showed that the time per operation consistently increased from 55ns to 67ns. That's still very good (much better than without this PR), but as long as we're changing this code, we may as well make it as fast as possible (within reason).I added you as a co-author, hope that's okay. I found better names for
iandn, thanks.I took your suggestion to remove the new method,
CBufferedFile::Search()and leave that for a later PR. That was a good suggestion because now this PR is small and more focused.It will be interesting to see if now the fuzzing passes. I don't understand how you determined (from the link you gave) that the problem is in
load_external_block_file. When I run that locally (with no arguments), it succeeds normally. Is there a way I can run exactly the same test to debug the problem? I see it uses some fuzzing assets; could I download those?hebasto commented at 8:04 AM on December 10, 2020: memberIt will be interesting to see if now the fuzzing passes. I don't understand how you determined (from the link you gave) that the problem is in
load_external_block_file.Job is timed out and the
load_external_block_fileis not listed in the log.hebasto approvedhebasto commented at 10:43 AM on December 10, 2020: memberACK 2f772934fe4eb7a10667a89afc5884b3f61a6ab0
Results of
src/bench/bench_bitcoin -filter=FindByteon my machine:- master (38176dc66523881d45705fe8f96b9da76ec9981b) + this PR
| ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 64.12 | 15,596,723.38 | 0.4% | 0.00 | `FindByte- master (38176dc66523881d45705fe8f96b9da76ec9981b) + this PR w/o the top commit
| ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,188.64 | 841,300.44 | 0.1% | 0.00 | `FindByte`CI tests in my personal account passed.
sipa commented at 6:56 PM on June 6, 2021: memberRebase to get new CI?
LarryRuane force-pushed on Jun 8, 2021LarryRuane commented at 8:56 PM on June 8, 2021: contributorRebased.
DrahtBot added the label Needs rebase on Jan 27, 2022LarryRuane force-pushed on Aug 1, 2022LarryRuane commented at 9:56 PM on August 1, 2022: contributorForce-pushed rebase, and also simplified the code significantly. Here's the previous (more complex) version of this PR, just for the record:
<details> <summary>Click to expand patch</summary>
--- a/src/streams.h +++ b/src/streams.h @@ -795,12 +795,17 @@ public: //! search for a given byte in the stream, and remain positioned on it void FindByte(char ch) { + size_t buf_offset = nReadPos % vchBuf.size(); while (true) { - if (nReadPos == nSrcPos) - Fill(); - if (vchBuf[nReadPos % vchBuf.size()] == ch) - break; - nReadPos++; + if (nReadPos == nSrcPos) Fill(); + const size_t len = std::min<size_t>(vchBuf.size() - buf_offset, nSrcPos - nReadPos); + const auto it_start = vchBuf.begin() + buf_offset; + const auto it_find = std::find(it_start, it_start + len, ch); + const size_t inc = it_find - it_start; + nReadPos += inc; + if (inc < len) break; + buf_offset += inc; + if (buf_offset >= vchBuf.size()) buf_offset = 0; } } };</details>
The previous version is slightly faster than the current patch, but it's not worth the added complexity. I re-ran the benchmark (added by the first commit in this PR), and the nanoseconds per operation (on my middle-range laptop) is 1,650 on
masterand 315 with this PR.DrahtBot removed the label Needs rebase on Aug 1, 2022in src/bench/streams_findbyte.cpp:24 in 0238e5e548 outdated
19 | + b = 1; 20 | + fwrite(&b, 1, 1, file); 21 | + rewind(file); 22 | + CBufferedFile bf(file, file_size * 2, file_size, 0, 0); 23 | + 24 | + bench.minEpochIterations(1e7).run([&] {
stickies-v commented at 12:39 PM on August 17, 2022:I don't think we need this many iterations? Bringing this down from
1e7to1e5reduces overall runtime from ~50 seconds to ~0.5 seconds, seemingly without any noticeable difference in ns/op (mean or variance) on two different machines.in src/bench/streams_findbyte.cpp:22 in 0238e5e548 outdated
17 | + fwrite(&b, 1, 1, file); 18 | + } 19 | + b = 1; 20 | + fwrite(&b, 1, 1, file); 21 | + rewind(file); 22 | + CBufferedFile bf(file, file_size * 2, file_size, 0, 0);
stickies-v commented at 1:16 PM on August 17, 2022:It's super small either way, but I don't think the buffer needs to be twice the file size?
CBufferedFile bf(file, file_size + 1, file_size, 0, 0);in src/bench/streams_findbyte.cpp:20 in 0238e5e548 outdated
15 | + uint8_t b = 0; 16 | + for (size_t i = 0; i < file_size; ++i) { 17 | + fwrite(&b, 1, 1, file); 18 | + } 19 | + b = 1; 20 | + fwrite(&b, 1, 1, file);
stickies-v commented at 1:24 PM on August 17, 2022:Slightly more concise and readable imo:
uint8_t data[file_size] = {0}; data[file_size-1] = 1; fwrite(&data, sizeof(uint8_t), file_size, file);
stickies-v commented at 1:40 PM on August 17, 2022:Less memory efficient, but at a file size of 200 bytes I think optimizing for readability is worthwhile?
stickies-v commented at 1:34 PM on August 17, 2022: contributorApproach ACK 0238e5e548d221299ccf57bb8e9633507320997f
On my older x86_64 MacOS laptop, I see very significant performance improvements, speeding up the benchmark from ~1900 ns/op to ~410 ns/op. On my newer M1 Pro MacOS laptop, the difference is much less pronounced: ~940 ns/op -> ~850 ns/op. Did plenty of iterations on both machines, results seem stable.
Given the simplicity and limited scope of this PR, I think this is still a worthwhile improvement (at no extra maintenance cost) even if it only benefits certain platforms.
Left a few comments, I think the bench # iterations needs to be reduced before merging, everything else is optional (but seems straightforward to implement too)
achow101 commented at 5:49 PM on January 18, 2023: memberAre you still working on this?
There is a silent merge conflict with master, and also unresolved comments.
LarryRuane commented at 6:04 PM on January 18, 2023: contributorAre you still working on this?
Yes, I think this is still worth doing, I'll refresh today, thanks for the ping.
LarryRuane force-pushed on Jan 18, 2023LarryRuane commented at 8:58 PM on January 18, 2023: contributorForce pushed to resolve a hidden conflict, also addressed all review comments.
stickies-v commented at 2:41 PM on January 19, 2023: contributorCuriously, this PR is now a significant performance degradation (152ns/op -> 417 ns/op) on my M1 Pro. I repeated the benchmarks on the same commits as my last review, and am getting the same performance degradation there now, too. According to my logs I was on macOS 12.4 on both dates. Results are stable enough between runs.
On my x86 (Debian Bullseye) machine, I'm still seeing significant performance improvement (1.717 ns/op -> 338 ns/op), as before.
<details> <summary>bench results</summary>
M1 Pro @ 5842d92c8 (unpatched)
| ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 151.60 | 6,596,094.28 | 0.4% | 0.01 | `FindByte`M1 Pro @ 08cf6a9ed812 (patched)
| ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 417.00 | 2,398,083.95 | 0.2% | 0.01 | `FindByte`x86 (Debian Bullseye) @ 5842d92c8 (unpatched)
| ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,716.73 | 582,504.42 | 0.0% | 0.01 | `FindByte`x86 (Debian Bullseye) @ 08cf6a9ed812 (patched)
| ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 337.99 | 2,958,650.44 | 1.5% | 0.01 | `FindByte`</details>
LarryRuane commented at 7:57 PM on January 20, 2023: contributorCuriously, this PR is now a significant performance degradation
As part of the last force-push I made this change:
--- b/src/bench/streams_findbyte.cpp +++ a/src/bench/streams_findbyte.cpp @@ -18,7 +18,7 @@ static void FindByte(benchmark::Bench& bench) rewind(file); CBufferedFile bf(file, /*nBufSize=*/file_size + 1, /*nRewindIn=*/file_size, 0, 0); - bench.minEpochIterations(1e7).run([&] { + bench.run([&] { bf.SetPos(0); bf.FindByte(1); });Is it possible the test isn't running enough iterations to get an accurate measurement? Although, still, you wouldn't expect the patch to be worse!
Can you try with
minEpochIterations(1e5)(or even back to1e7) and see if you get results consistent with before? I probably should restore the iterations count (and set it to 1e5 as you suggest).I get consistently 136 ns/op with the patch, and 300 ns/op without the patch, with either iterations set to
1e5or1e7. This is Debian Ubuntu jammy AMD x86 desktop.stickies-v commented at 12:11 AM on January 22, 2023: contributorCan you try with
minEpochIterations(1e5)(or even back to1e7) and see if you get results consistent with before?I'd tried that already, but the results are pretty much the same.
fanquake commented at 10:39 AM on February 28, 2023: memberWhat are the next steps here? Anyone want to look into why this is apparently a performance degredation for ARM macOS? Maybe @john-moffett?
john-moffett commented at 6:47 PM on February 28, 2023: contributorI think it's mainly an architecture issue. I'm seeing similar performance degradation on my M1 to what @stickies-v reported. It's possible there could be pipeline issues with the conditional selection instructions that get generated by this line:
if (++buf_offset >= vchBuf.size()) buf_offset = 0;The
std::findversion (eg - https://github.com/bitcoin/bitcoin/commit/2f772934fe4eb7a10667a89afc5884b3f61a6ab0) does not have the same issue, and I see significant performance gains in both x86 and ARM. Specifically, on my M1, master is 145 ns/op, 08cf6a9ed812c122e9d47cccee09e03d8139764c is 420 ns/op, and thestd::findversion is 81 ns/op.My suggestion would be to switch back to that version.
LarryRuane force-pushed on Mar 2, 2023LarryRuane commented at 6:17 PM on March 2, 2023: contributor@john-moffett - Good idea to bring back the earlier commit (the condition branch instruction theory makes sense); I just restored (force-pushed) it as you suggested. On my x86 (ns/op, lower is better):
master: 307 previous version of this PR (08cf6a9ed812c122e9d47cccee09e03d8139764c): 135 current version (using std::find): 34
achow101 commented at 4:13 PM on March 15, 2023: memberACK dacd3316ca0279c56d12372957633b73a999b5b2
DrahtBot requested review from hebasto on Mar 15, 2023DrahtBot requested review from laanwj on Mar 15, 2023DrahtBot requested review from sipa on Mar 15, 2023in src/streams.h:747 in dacd3316ca outdated
743 | @@ -744,13 +744,22 @@ class CBufferedFile 744 | //! search for a given byte in the stream, and remain positioned on it 745 | void FindByte(uint8_t ch) 746 | { 747 | + const std::byte byte{static_cast<std::byte>(ch)};
stickies-v commented at 6:47 PM on March 15, 2023:Given that there is only a single non-test callsite of
FindByte(), would it make sense to just update the fn signature to takestd::bytedirectly?in src/streams.h:748 in dacd3316ca outdated
743 | @@ -744,13 +744,22 @@ class CBufferedFile 744 | //! search for a given byte in the stream, and remain positioned on it 745 | void FindByte(uint8_t ch) 746 | { 747 | + const std::byte byte{static_cast<std::byte>(ch)}; 748 | + size_t buf_offset = m_read_pos % vchBuf.size();
stickies-v commented at 6:47 PM on March 15, 2023:nit: and a bunch more of those
size_t buf_offset{m_read_pos % vchBuf.size()};in src/streams.h:758 in dacd3316ca outdated
743 | @@ -744,13 +744,22 @@ class CBufferedFile 744 | //! search for a given byte in the stream, and remain positioned on it
stickies-v commented at 11:44 AM on March 16, 2023:I think some of the rationale of this implementation should be in the docs so future contributors don't simplify the code again to unintentionally undo the performance gains, e.g. why the modulo operator is kept outside of the while loop seems quite important and non-trivial?
stickies-v commented at 12:14 PM on March 16, 2023:Unrelated to this PR, but I think it would also be helpful to improve the docs to specify that if
chis not found,std::ios_base::failureis thrown (fromFill()). It's an essential part of the interface imo.Perhaps worth improving on this behaviour, and have
FindByte()throw its own error, by wrapping theFill()command in a try/catch? Orthogonal to this PR, though. (And I also don't like that we're catching a generalstd::exceptionfor aFindByte()failure, but again, orthogonal.)stickies-v commented at 12:21 PM on March 16, 2023: contributorApproach ACK dacd3316c
I'm now seeing bench performance improvement on my M1 again, from ~150ns/op -> ~85ns/op.
LarryRuane force-pushed on Mar 20, 2023LarryRuane commented at 4:08 AM on March 20, 2023: contributorForce-pushed for review comments (thanks, @stickies-v), verified benchmark performance is unchanged (ns/op with PR: 34.76, without PR: 302). Summary of force-push changes:
- change
FindByte()argument type fromuint8_ttostd::byte) - add "exception" comment before call to
Fill() - add comment suggesting to avoid mod (%) operator within loop
- changed assignment statements to use more modern braces syntax
LarryRuane force-pushed on Mar 20, 2023LarryRuane force-pushed on Mar 20, 2023LarryRuane commented at 8:08 PM on March 20, 2023: contributorForce-pushed again to fix CI failures.
in src/streams.h:1 in 0fe832c4a4
stickies-v commented at 5:07 PM on March 21, 2023:Instead of changing the callsites of
FindByte(), how about adding auint8_toverload? I think it keeps the implementation clean, but since it can easily be argued thatuint8_talso is a byte, this keeps the callsites straightforward and reduces the diff.<details> <summary>git diff</summary>
diff --git a/src/bench/streams_findbyte.cpp b/src/bench/streams_findbyte.cpp index 175564fe9..7b2e65da2 100644 --- a/src/bench/streams_findbyte.cpp +++ b/src/bench/streams_findbyte.cpp @@ -20,7 +20,7 @@ static void FindByte(benchmark::Bench& bench) bench.run([&] { bf.SetPos(0); - bf.FindByte(std::byte(1)); + bf.FindByte(1); }); // Cleanup diff --git a/src/streams.h b/src/streams.h index 2558bd830..9280fa013 100644 --- a/src/streams.h +++ b/src/streams.h @@ -763,6 +763,8 @@ public: if (buf_offset >= vchBuf.size()) buf_offset = 0; } } + + void FindByte(uint8_t byte) { return FindByte(static_cast<std::byte>(byte)); } }; #endif // BITCOIN_STREAMS_H diff --git a/src/test/fuzz/buffered_file.cpp b/src/test/fuzz/buffered_file.cpp index 2f7ce60c7..67cac8fa4 100644 --- a/src/test/fuzz/buffered_file.cpp +++ b/src/test/fuzz/buffered_file.cpp @@ -53,7 +53,7 @@ FUZZ_TARGET(buffered_file) return; } try { - opt_buffered_file->FindByte(std::byte(fuzzed_data_provider.ConsumeIntegral<uint8_t>())); + opt_buffered_file->FindByte(fuzzed_data_provider.ConsumeIntegral<uint8_t>()); } catch (const std::ios_base::failure&) { } }, diff --git a/src/test/streams_tests.cpp b/src/test/streams_tests.cpp index 79bc7b7c0..1db5b61f1 100644 --- a/src/test/streams_tests.cpp +++ b/src/test/streams_tests.cpp @@ -462,7 +462,7 @@ BOOST_AUTO_TEST_CASE(streams_buffered_file_rand) size_t find = currentPos + InsecureRandRange(8); if (find >= fileSize) find = fileSize - 1; - bf.FindByte(std::byte(find)); + bf.FindByte(find); // The value at each offset is the offset. BOOST_CHECK_EQUAL(bf.GetPos(), find); currentPos = find; diff --git a/src/validation.cpp b/src/validation.cpp index a79b81add..b42b39861 100644 --- a/src/validation.cpp +++ b/src/validation.cpp @@ -4438,7 +4438,7 @@ void Chainstate::LoadExternalBlockFile( try { // locate a header unsigned char buf[CMessageHeader::MESSAGE_START_SIZE]; - blkdat.FindByte(std::byte(params.MessageStart()[0])); + blkdat.FindByte(params.MessageStart()[0]); nRewind = blkdat.GetPos() + 1; blkdat >> buf; if (memcmp(buf, params.MessageStart(), CMessageHeader::MESSAGE_START_SIZE)) {</details>
in src/streams.h:767 in 0fe832c4a4 outdated
750 | while (true) { 751 | - if (m_read_pos == nSrcPos) 752 | + if (m_read_pos == nSrcPos) { 753 | + // No more bytes available; read from the file into the buffer, 754 | + // setting nSrcPos to one beyond the end of the new data. 755 | + // Throws exception if end-of-file reached.
stickies-v commented at 5:08 PM on March 21, 2023:Given that it's part of the interface, I think this needs to be documented on the function level so devs wanting to use FindByte know how it behaves when the byte isn't found - they shouldn't need to dive into the implementation.
in src/streams.h:761 in 0fe832c4a4 outdated
741 | @@ -742,15 +742,25 @@ class CBufferedFile 742 | } 743 | 744 | //! search for a given byte in the stream, and remain positioned on it 745 | - void FindByte(uint8_t ch) 746 | + void FindByte(std::byte byte) 747 | { 748 | + // For best performance, avoid mod operation within the loop.
stickies-v commented at 5:12 PM on March 21, 2023:nit
// The modulus operation is much more expensive than byte // comparison and addition, so we keep it out of the loop to // improve performance (see [#19690](/bitcoin-bitcoin/19690/) for discussion).stickies-v approvedstickies-v commented at 1:38 PM on March 22, 2023: contributorACK 0fe832c4a
DrahtBot requested review from achow101 on Mar 22, 2023achow101 commented at 6:35 PM on April 21, 2023: memberACK 0fe832c4a4b2049fdf967bca375468d5ac285563
DrahtBot removed review request from achow101 on Apr 21, 2023john-moffett approvedjohn-moffett commented at 7:04 PM on April 21, 2023: contributorACK 0fe832c4a4b2049fdf967bca375468d5ac285563
achow101 commented at 10:45 PM on April 21, 2023: memberSilent merge conflict with master:
../../../src/bench/streams_findbyte.cpp:7:10: fatal error: fs.h: No such file or directory 7 | #include <fs.h> | ^~~~~~ compilation terminated.[bench] add streams findbyte 604df63f6c72efc26439util: improve streams.h:FindByte() performance
Avoid use of the expensive mod operator (%) when calculating the buffer offset. No functional difference. Co-authored-by: Hennadii Stepanov <32963518+hebasto@users.noreply.github.com>
LarryRuane force-pushed on May 5, 2023LarryRuane commented at 12:04 PM on May 5, 2023: contributorForce pushed rebase to fix hidden merge conflict, thanks @achow101
stickies-v approvedstickies-v commented at 1:16 PM on May 5, 2023: contributorre-ACK 72efc26439da9a1344a19569fb0cab01f82ae7d1
Verified that the only difference is to include
<util/fs.h>instead of<fs.h>(introduced by https://github.com/bitcoin/bitcoin/pull/27254/commits/00e9b97f37e0bdf4c647236838c10b68b7ad5be3)% git range-diff HEAD~2 0fe832c4a4b2049fdf967bca375468d5ac285563 HEAD 1: 5842d92c8 ! 1: 604df63f6 [bench] add streams findbyte @@ src/bench/streams_findbyte.cpp (new) + +#include <bench/bench.h> + -+#include <fs.h> ++#include <util/fs.h> +#include <streams.h> + +static void FindByte(benchmark::Bench& bench) 2: 0fe832c4a = 2: 72efc2643 util: improve streams.h:FindByte() performanceDrahtBot requested review from achow101 on May 5, 2023DrahtBot requested review from john-moffett on May 5, 2023achow101 commented at 9:40 PM on May 10, 2023: memberre-ACK 72efc26439da9a1344a19569fb0cab01f82ae7d1
DrahtBot removed review request from achow101 on May 10, 2023achow101 merged this on May 10, 2023achow101 closed this on May 10, 2023sidhujag referenced this in commit 7dc3f777e0 on May 11, 2023bitcoin locked this on May 9, 2024
github-metadata-mirror
This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-13 15:14 UTC
This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me