sdaftuar
commented at 8:35 pm on July 26, 2022:
member
New nodes starting up for the first time lack protection against DoS from low-difficulty headers. While checkpoints serve as our protection against headers that fork from the main chain below the known checkpointed values, this protection only applies to nodes that have been able to download the honest chain to the checkpointed heights.
We can protect all nodes from DoS from low-difficulty headers by adopting a different strategy: before we commit to storing a header in permanent storage, first verify that the header is part of a chain that has sufficiently high work (either nMinimumChainWork, or something comparable to our tip). This means that we will download headers from a given peer twice: once to verify the work on the chain, and a second time when permanently storing the headers.
The p2p protocol doesn’t provide an easy way for us to ensure that we receive the same headers during the second download of peer’s headers chain. To ensure that a peer doesn’t (say) give us the main chain in phase 1 to trick us into permanently storing an alternate, low-work chain in phase 2, we store commitments to the headers during our first download, which we validate in the second download.
Some parameters must be chosen for commitment size/frequency in phase 1, and validation of commitments in phase 2. In this PR, those parameters are chosen to both (a) minimize the per-peer memory usage that an attacker could utilize, and (b) bound the expected amount of permanent memory that an attacker could get us to use to be well-below the memory growth that we’d get from the honest chain (where we expect 1 new block header every 10 minutes).
After this PR, we should be able to remove checkpoints from our code, which is a nice philosophical change for us to make as well, as there has been confusion over the years about the role checkpoints play in Bitcoin’s consensus algorithm.
Thanks to Pieter Wuille for collaborating on this design.
DrahtBot added the label
P2P
on Jul 26, 2022
sdaftuar force-pushed
on Jul 26, 2022
jamesob
commented at 11:32 pm on July 26, 2022:
member
Concept ACK, will review soon
lish2099 approved
DrahtBot
commented at 6:03 am on July 27, 2022:
contributor
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
Conflicts
Reviewers, this pull request conflicts with the following ones:
#25923 (p2p, bugfix: always provide CNodeStateStats, ensure v23 getpeerinfo API for v24 by jonatack)
#25725 (consensus: Remove mainnet checkpoints by sdaftuar)
#25673 (refactor: make member functions const when applicable by aureleoules)
#25555 (refactor: Move m_num_unconnecting_headers_msgs out of cs_main guard by MarcoFalke)
#24571 (p2p: Prevent block index fingerprinting by sending additional getheaders messages by dergoegge)
#23443 (p2p: Erlay support signaling by naumenkogs)
#23352 (test: Extend stale_tip_peer_management test by MarcoFalke)
#18933 (rpc: Add submit option to generateblock by MarcoFalke)
#18470 (test: Extend stale tip test by MarcoFalke)
#17860 (fuzz: BIP 42, BIP 30, CVE-2018-17144 by MarcoFalke)
#16981 (Improve runtime performance of –reindex by LarryRuane)
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
glozow requested review from dergoegge
on Jul 27, 2022
glozow requested review from mzumsande
on Jul 27, 2022
glozow requested review from ajtowns
on Jul 27, 2022
sdaftuar force-pushed
on Jul 27, 2022
sdaftuar force-pushed
on Jul 27, 2022
sdaftuar force-pushed
on Jul 27, 2022
sdaftuar force-pushed
on Jul 27, 2022
sdaftuar force-pushed
on Jul 27, 2022
sdaftuar
commented at 3:59 pm on July 27, 2022:
member
It’s worth pointing out that initial headers sync will be slowed down by this PR, not just because we will download headers twice before storing, but also because the additional latency to make progress on syncing with our initial headers-sync-peer will mean that we’re more likely to receive a block announcement while waiting for headers sync to finish, which in turn will trigger a headers sync with all our peers that announce the block. (And our starting point for those syncs will be further behind, because we don’t make progress on adding headers to our block index until phase 2.)
I opened #25720 as a mitigation for this effect; many strategies are possible and I think they are orthogonal to the change proposed here, so I’d prefer to leave the discussion of this particular issue to that PR.
luke-jr
commented at 7:53 pm on July 27, 2022:
member
This seems to be based on the assumption that the DoS is attacking disk space, but bandwidth tends to be more limited than space, and it makes that worse even in the best scenario…?
sipa
commented at 8:24 pm on July 27, 2022:
member
@luke-jr The DoS concern is primarily about memory usage: filling mapBlockIndex and other data structures with low-difficulty headers before a checkpoint is reached.
Bandwidth is a concern for sure, but:
There are many more effective ways to perform bandwidth DoS (like spamming INV messages, ADDR messages, PING messages, continuously requesting non-existing transactions or blocks, …).
The proper solution to bandwidth DoS is just throttling peers that waste too much of it.
In non-attack scenarios, the added bandwidth by this PR (especially when combined with #25720) is in the order of 10s of MB over a node’s lifetime, which is negligible compared to the full block download.
luke-jr
commented at 8:33 pm on July 27, 2022:
member
Is there a reason to not just prune the block index under some conditions?
sipa
commented at 9:14 pm on July 27, 2022:
member
It may make sense to include the script in the repository (as I can imagine it being re-run to tune things occasionally, but there should not be a need to do it more than every few years). It could also live on the devwiki or just linked-to in the code. Opinions?
sipa
commented at 10:29 pm on July 27, 2022:
member
@luke-jr I think that’s a potentially useful but orthogonal improvement, though there is probably much less need for it after this PR.
215+ if ((current_height - m_chain_start->nHeight) % HEADER_COMMITMENT_FREQUENCY == 0) {
216+ // Try to add a commitment.
217+ m_header_commitments.push_back(m_hasher(current.GetHash()) & 1);
218+ if (m_header_commitments.size() > m_max_commitments) {
219+ // The peer's chain is too long; give up.
220+ // TODO: disconnect this peer.
Why not do that in this PR? Maybe add a new state FAILED (or an extra bool on HeadersSyncState) that signals to net_processing that the peer should be disconnected? That seems simple enough and in obvious cases like this one (i.e. too many commitments or e.g. invalid PoW) the peer definitely behaved funky and should be punished.
Now that I think about it, it’s possible that if the peer’s chain has grown (and the sync takes very long) that you could get to this condition; the right thing to do in that scenario (assuming our peer has an actually more work chain) is to try to sync with this peer again later. (Obviously this is a pathological case, but I think we should still be able to sync such a chain anyway, eventually.)
Fixed the braces. Should I also fix all the variable names? (Saw @instagibbs’ comment along those lines just now, so if you agree then I’ll modernize the whole thing.)
0@@ -0,0 +1,318 @@
1+#include <headerssync.h>
2+#include <logging.h>
3+#include <util/check.h>
4+#include <pow.h>
5+#include <timedata.h>
6+
7+// Store a commitment to a header every HEADER_COMMITMENT_FREQUENCY blocks
Nit in “Utilize anti-DoS headers download strategy”:
Perhaps use doxygen comments for comments on a function/variable definition or declaration? (either /** ... */ or //! ...). That way they get picked up by doxygen for generating documentation.
19+HeadersSyncState::HeadersSyncState(NodeId id, const Consensus::Params& consensus_params) :
20+ m_id(id), m_consensus_params(consensus_params)
21+{ }
22+
23+// Free any memory in use, and mark this object as no longer usable.
24+void HeadersSyncState::Finalize()
18+
19+HeadersSyncState::HeadersSyncState(NodeId id, const Consensus::Params& consensus_params) :
20+ m_id(id), m_consensus_params(consensus_params)
21+{ }
22+
23+// Free any memory in use, and mark this object as no longer usable.
56+ // chain using 6 blocks/second (fastest blockrate given the MTP rule) times
57+ // the number of seconds from the last allowed block until today. This
58+ // serves as a memory bound on how many commitments we might store from
59+ // this peer, and we can safely give up syncing if the peer exceeds this
60+ // bound, because it's not possible for a consensus-valid chain to be
61+ // longer than this.
In commit “Utilize anti-DoS headers download strategy”
Arguably, it’s not exactly true that no consensus-valid chain can be longer, but it’s not possible for the peer to have such a chain at the time the sync starts (it’d need a tip timestamp that is in the future at the time the sync starts).
In commit “Utilize anti-DoS headers download strategy”
Any reason to do this here, and not in StartInitialDownload? It seems a bit cleaner to set these values as soon as they’re known, rather than on the fly.
42+ const std::vector<CBlockHeader>& initial_headers, const arith_uint256&
43+ minimum_required_work, CBlockLocator&& chain_start_locator)
44+{
45+ // A new instance of this object should be instantiated for every headers
46+ // sync, so that we don't reuse our salted hasher between syncs.
47+ assert(m_download_state == State::INITIAL_DOWNLOAD);
174+ // Doesn't connect to what we expect
175+ Finalize();
176+ return false;
177+ }
178+
179+ // If it does connect, (minimally) validate and store a commitment to each one.
in
test/functional/p2p_headers_sync_with_minchainwork.py:30
in
7bed25cbd3outdated
25+ def test_chains_sync_when_long_enough(self):
26+ self.log.info("Generate blocks on the node with no required chainwork")
27+ self.generate(self.nodes[0], NODE1_BLOCKS_REQUIRED-2, sync_fun=self.no_op)
28+
29+ self.log.info("Verify nodes 1 and 2 have no new headers in their headers tree")
30+ time.sleep(2)
Hm, will assert_debug_log work properly here if there might be multiple “[net] Ignoring low-work chain” lines in our debug.log? (Perhaps that would work if I generated one block at a time in a loop, rather than invoke generate once with multiple blocks?)
in
test/functional/p2p_headers_sync_with_minchainwork.py:29
in
7bed25cbd3outdated
24+
25+ def test_chains_sync_when_long_enough(self):
26+ self.log.info("Generate blocks on the node with no required chainwork")
27+ self.generate(self.nodes[0], NODE1_BLOCKS_REQUIRED-2, sync_fun=self.no_op)
28+
29+ self.log.info("Verify nodes 1 and 2 have no new headers in their headers tree")
+1, it looks like node2 currently doesn’t receive any headers because it’s not connected to node0?
IIRC the default is that 0-1 and 1-2 are connected. My suggestion would be to override setup_network with:
in
test/functional/p2p_headers_sync_with_minchainwork.py:46
in
7bed25cbd3outdated
42+ self.log.info("Verify that node1 syncs node0's chain")
43+ time.sleep(2)
44+ self.sync_blocks(self.nodes[0:2])
45+
46+ self.log.info("Verify that node2 still has no new headers")
47+ time.sleep(2)
In commit “Utilize anti-DoS headers download strategy”
Just a thought for later, but it’s rather ugly to have to construct a CBlockIndex object just to be able to call GetBlockProof. I think GetBlockProof should work (or have a variant that works) with a CBlockHeader too, or even just the nBits value.
in
src/headerssync.cpp:241
in
de72d0867doutdated
236+ if (m_download_state == State::FINAL) return false;
237+
238+ // Ensure that we're working on a header that connects to the chain we're
239+ // downloading.
240+ if (m_redownloaded_headers.empty()) {
241+ if (header.hashPrevBlock != m_chain_start->GetBlockHash()) {
In commit “Utilize anti-DoS headers download strategy”
Is it possible to set this m_redownload_buffer_last_hash = m_chain_start->GetBlockHash(); m_redownload_buffer_last_height = m_chain_start->nHeight; in ValidateAndStoreHeadersCommitments already when the transition to REDOWNLOAD state is made?
267+ // Store this header for later processing.
268+ m_redownloaded_headers.push_back(header);
269+ m_redownload_buffer_last_height = next_height;
270+ m_redownload_buffer_last_hash = header.GetHash();
271+
272+ // If we're processing our target block header, which we verified has
In commit “Utilize anti-DoS headers download strategy”
Could there be a concern that the last or few last headers or so get reorganized during the second phase, resulting in a mismatch at the end? If so, perhaps it’s possible to instead keep track of chainwork again in the second phase rather than remembering the exact hash at which the threshold was reached in the first phase?
I added a commit that tries to improve this. Left it unsquashed for now so that it’s easier to review; I don’t believe there are any material DoS risks introduced by this change but just want to make sure we get enough eyes on this logic.
dergoegge
commented at 2:41 pm on July 28, 2022:
member
Concept & Approach ACK
Left some comments (mostly about the added functional test)
135+ * can process and validate now (because these returned
136+ * headers are on a chain with sufficient work)
137+ * processing_success: set to false if an error is detected and the sync is
138+ * aborted; true otherwise.
139+ */
140+ std::optional<CBlockLocator> ProcessNextHeaders(const std::vector<CBlockHeader>& headers,
In commit “Utilize anti-DoS headers download strategy”
This function really returns 3 things: an optional block locator, a vector of headers to process, and a bool processing_success. Having those spread over return values and mutable arguments is a bit ugly.
How about returning a typedef’d std::optional<std::pair<std::optional<CBlockLocator>, std::vector<CBlockHeader>>>, or making a simple custom struct to return the results in?
I’m not sure there’s a great way to do this without essentially duplicating the code from pow.cpp; is that what you have in mind? (That might be a reasonable suggestion, I’m just not sure what makes the most sense.)
instagibbs
commented at 12:59 pm on July 29, 2022:
Not convinced either way, maybe just comment on the relationship here?
Well, it could be, but the test would still be correct if pindexLast.nBits were modified to be within a factor of 4 from the max difficulty target.
So it seems to me that having expected_nbts = 0x1d00ffffU is the more important line as it captures exactly what the test case is trying to exercise here (and pindexLast.nBits could be set from that, or not).
in
src/test/pow_tests.cpp:73
in
cbf3069c7doutdated
@instagibbs In the old Satoshi-era naming convention, it referred to “bignum” (there was a CBigNum wrapper around the OpenSSL BIGNUM type, which was used for various big integer operations, both inside script and for PoW purposes).
For the purposes of what is needed for the security analysis of the overall header commitment scheme, the only requirements are that this verifies (a) that the difficulty doesn’t change on non-2016-multiple blocks and (b) doesn’t go up or down too quickly. Its goal is forcing the attacker to spread out their attempted PoW over many blocks, rather than just one or a few (because creating N blocks with each difficulty D is much harder than creating one block with difficulty N*D, if the hashpower available to the attacker is less than the expected value for an N*D difficulty block).
No objection to checking whatever can be checked with the provided arguments, but I think the current code just chooses to check exactly what is needed, and document it, rather than verify everything possible.
I think I prefer to just not check anything for chains where proof-of-work isn’t an anti-DoS mechanism for us; seems like it avoids a false sense of security (and it doesn’t accomplish anything anyway, as an attacker would just avoid failing such a check without expending any real resources).
sdaftuar force-pushed
on Jul 28, 2022
in
test/functional/p2p_headers_sync_with_minchainwork.py:23
in
ec2723b39foutdated
18+ self.num_nodes = 3
19+ # Node0 has no required chainwork; node1 requires 15 blocks on top of the genesis block; node2 requires 2047
20+ self.extra_args = [["-minimumchainwork=0x0"], ["-minimumchainwork=0x1f"], ["-minimumchainwork=0x1000"]]
21+
22+ def skip_test_if_missing_module(self):
23+ self.skip_if_no_wallet()
in
test/functional/p2p_headers_sync_with_minchainwork.py:53
in
ec2723b39foutdated
26+ self.log.info("Generate blocks on the node with no required chainwork")
27+ self.generate(self.nodes[0], NODE1_BLOCKS_REQUIRED-2, sync_fun=self.no_op)
28+
29+ self.log.info("Verify nodes 1 and 2 have no new headers in their headers tree")
30+ time.sleep(2)
31+ for node in self.nodes[1:]:
I agree with removing the sleep() since it’s not a very reliable way to wait for things to happen in the functional tests (which are often run in parallel). Not sure about relying on assert_debug_log.
I’d recommend wait_until the headers are received, for example:
0 self.log.info("Generate blocks on the node with no required chainwork")
1 node1_recv_headers = self.nodes[1].getpeerinfo()[0]["bytesrecv_per_msg"]["headers"]
2 node2_recv_headers = self.nodes[2].getpeerinfo()[0]["bytesrecv_per_msg"]["headers"]
3 self.generate(self.nodes[0], NODE1_BLOCKS_REQUIRED-2, sync_fun=self.no_op)
45 self.log.info("Verify nodes 1 and 2 have no new headers in their headers tree")
6 self.wait_until(lambda: self.nodes[1].getpeerinfo()[0]["bytesrecv_per_msg"]["headers"] > node1_recv_headers)
7 self.wait_until(lambda: self.nodes[2].getpeerinfo()[0]["bytesrecv_per_msg"]["headers"] > node2_recv_headers)
8 for node in self.nodes[1:]:
(this is using getpeerinfo()[0] assuming node1 and node2 have node0 as their first peer)
Ah, that’s clever, but I think the problem is that the number of headers messages that come in after a call to generate is variable, based on the particular blocks that get INV’ed and the responses they generate.
I took an approach of changing the logging to include the height of the chain, and then used the assert_debug_log approach to check for it. This gets rid of all the sleeps and I think ought to work?
in
src/test/fuzz/bitdeque.cpp:19
in
5072054428outdated
14+namespace {
15+
16+constexpr int LEN_BITS = 16;
17+constexpr int RANDDATA_BITS = 20;
18+
19+using bitdeque_type = bitdeque<128>;
@glozow The testing power is significantly less if you need 16384 booleans before another allocation is made (because interactions that involve many separate allocations are much more likely to trigger bugs).
101+ // sending a new getheaders immediately could trigger an infinite
102+ // loop. Just give up for now; if our peer ever gives us an block
103+ // INV later we will fetch headers then, and likely retrigger this
104+ // logic.
105+ Finalize();
106+ return ret;
There are lots of return ret; statements in this function now. Perhaps it’s cleaner to restructure it so that there is only a single return statement at the end, and all the branches just modify the values in ret?
5+#include <util/check.h>
6+
7+//! Store a commitment to a header every HEADER_COMMITMENT_FREQUENCY blocks
8+constexpr size_t HEADER_COMMITMENT_FREQUENCY = 571;
9+//! Pull out headers for acceptance during redownload once our buffer exceeds this amount
10+constexpr size_t REDOWNLOAD_HEADERS_THRESHOLD = 25*HEADER_COMMITMENT_FREQUENCY;
Perhaps 25 should be named, e.g. REDOWNLOAD_BUFFER_THRESHOLD? IIUC it’s possible to change this in the future if the optimization script outputs a different rsize?
0//! Store a commitment to a header every HEADER_COMMITMENT_FREQUENCY blocks
1constexpr size_t HEADER_COMMITMENT_FREQUENCY{571};
2//! Pull out headers for acceptance during redownload once our buffer exceeds this number of headers
3constexpr size_t REDOWNLOAD_HEADERS_THRESHOLD{25*HEADER_COMMITMENT_FREQUENCY};
@glozowREDOWNLOAD_HEADERS_THRESHOLD will (in optimal configurations) always be a multiple of HEADER_COMMITMENT_FREQUENCY. This could just have been written as just = 14275;, but that would make this multiple-of relation less clear.
28+ * This function only checks that the new value is within a factor of 4 of the
29+ * old value for blocks at the difficulty adjustment interval, and otherwise
30+ * requires the values to be the same.
31+ *
32+ * Always returns true on networks where min difficulty blocks are allowed,
33+ * such as regtest/testnet.
@glozow No, on non-difficulty-adjustment blocks it returns whether the difficulty changed.
glozow
commented at 4:19 pm on August 1, 2022:
member
Approach ACK, still reviewing the code. Ran this on mainnet a few times as a sanity check, synced fine. Ran the fuzzer for a while.
It may make sense to include the script in the repository (as I can imagine it being re-run to tune things occasionally, but there should not be a need to do it more than every few years). It could also live on the devwiki or just linked-to in the code. Opinions?
Maybe contrib/devtools? Devwiki seems fine too, no strong opinion.
I have some questions trying to understand headerssync_params.py. Why is BLOCK_INTERVAL = timedelta(seconds=598, microseconds=800000) rather than exactly 10min (link)? Is it for expected hashrate increase? And based on “especially as this attack is only possible before the victim has learned about the honest chain” (link) does this mean we should always attempt to sync headers from all/multiple outbound peers at once?
sipa
commented at 9:56 pm on August 1, 2022:
member
It may make sense to include the script in the repository (as I can imagine it being re-run to tune things occasionally, but there should not be a need to do it more than every few years). It could also live on the devwiki or just linked-to in the code. Opinions?
Maybe contrib/devtools? Devwiki seems fine too, no strong opinion.
My current thinking is to add it to contrib/devtools in a follow-up PR, together with instructions in the release-process to run/update it. No need to add the review burden of that here, as the current values are likely more than sufficient for at least a few years.
I have some questions trying to understand headerssync_params.py.
Why is BLOCK_INTERVAL = timedelta(seconds=598, microseconds=800000) rather than exactly 10min (link)? Is it for expected hashrate increase?
Yes, the number is just the average block rate in 2021, approximately. It barely matters, so I’ve changed it to just 600 seconds. A few percent off on this value isn’t going to change the result.
And based on “especially as this attack is only possible before the victim has learned about the honest chain” (link) does this mean we should always attempt to sync headers from all/multiple outbound peers at once?
I don’t think so; that also worsens the attack potential in addition to reducing it, because it increases the chance that at least one of the nodes synced from will be an attacker, and gives them a window while the first (eventually) successful hasn’t reached the second stage yet.
And syncing from all peers at once is a pretty extreme position to take from a bandwidth optimization perspective (in non-attack scenarios). For most aspects of the P2P protocol, we attempt to never request the same thing twice simultaneously (this is true for transactions and blocks, except the high-bandwidth compact block mode which makes a bounded number of exceptions). Headers are small, so strictly requiring only a single header sync in flight is pretty extreme and opens up the ability for peers to stall the sync for a long time, but fetching from all at once means wasting possibly several GB of volume. #25720 picks something in between: start one sync for each new block announcement.
glozow
commented at 4:03 pm on August 2, 2022:
member
mzumsande
commented at 10:17 pm on August 2, 2022:
It might save a few operations to first check if the new target is larger or smaller than the old one, and conditional on that run only one of the two checks.
I made an attempt at this suggestion to see if the code would look any simpler, but it just seems to make it slightly more complex to think about (and not any shorter, at least the way I wrote it, which is mimicking the code in CalculateNextWorkRequired()).
At any rate, I also think the performance effect should be very minor, since we only do this once every 2000 headers (and nothing here should be all that slow to begin with, I think?).
I’ll leave this as-is, and if there’s some reason to make further improvements here, perhaps someone can pick it up in a followup PR.
in
src/net_processing.cpp:5074
in
c64d02a21doutdated
5071@@ -5076,6 +5072,24 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
50725073 MaybeSendAddr(*pto, *peer, current_time);
50745075+ // Delay sending SENDHEADERS (BIP 130) until we're done with an
5076+ // initial-headers-sync with this peer. Receiving headers announcements for
mzumsande
commented at 10:45 pm on August 2, 2022:
Would we actually receive headers messages during initial-headers-sync without this commit? As I wrote here, I got the impression that other peers revert to Inv when they found a new block during headers sync and won’t send us unrequested headers messages anyway, so I wonder if delaying SENDHEADERS actually has an effect.
I believe it is possible, for 2 main reasons: (1) BIP 130 has no explicit rules around when headers may be used to announce a block, so in theory any software implementing BIP 130 might be free to send a header to us regardless of whether we’ve completed header sync; (2) Bitcoin Core implementations would send us announcements via header at the end of our phase 1 download (if we reach their tip at the end of that phase, which is certainly possible). So I think delaying sendheaders helps eliminate the common cases of how we might get a spurious headers message during our sync.
in
src/net_processing.cpp:2826
in
6335a03b9eoutdated
2682@@ -2483,25 +2683,31 @@ void PeerManagerImpl::ProcessHeadersMessage(CNode& pfrom, Peer& peer,
2683 }
26842685 // At this point, the headers connect to something in our block index.
2686- if (!CheckHeadersAreContinuous(headers)) {
2687- Misbehaving(peer, 20, "non-continuous headers sequence");
2688+ // Do anti-DoS checks to determine if we should process or store for later
2689+ // processing.
2690+ if (!already_validated_work && TryLowWorkHeadersSync(peer, pfrom,
I think it would be good to have some unconditional logging during the first phase, because the node is currently quiet for possibly minutes after startup, which might lead users to think the node is stuck. Currently, the “Synchronizing blockheaders, height: (…) (~(…)%)” messages only start in the redownload phase. We should probably not log too much to prevent possible disk-filling attacks caused by a low-work header chain, but at least announcing the start of each phase would be nice.
So if I add unconditional logging on starting a sync, then in the absence of a rate limiter I think it’d be easy for someone to fill up our disk by starting and restarting syncs with us. Perhaps I can log at most once per peer connection when we start a low-work headers sync, what do you think?
instagibbs
commented at 4:27 pm on August 3, 2022:
maybe a per connection per X seconds, which would give consistent feedback that’s ratelimited?
Ideally, I’d say, we produce header sync progress updates based on the longest currently-ongoing header sync across all peers. That sounds like a pain to implement, though.
An additional complication in logging is that the first phase takes place entirely in net_processing, in per-peer datastructures. There is no interaction with validation, which currently emits the header sync progress updates, until the second phase.
Maybe once per connection, and excluding inbound peers would be safe? Alternatively, we could log something once on startup if our best chain is below nMinimumChainWork. I’m thinking of the typical case of a new node without any headers starting up.
in
src/headerssync.h:95
in
6335a03b9eoutdated
90+ * that we see. With a reasonable choice of N, this uses relatively little
91+ * memory even for a very long chain.
92+ *
93+ * - In phase 2 (redownload), keep a lookahead buffer of size H, and only
94+ * accept a batch of N (N < H) headers to memory once we've verified that H/N =
95+ * S commitments have all passed verification. With this parametrization, we
I can’t really relate this comment to the implementation. From what I understand we accept all received headers in phase 2 to memory (m_redownloaded_headers), but only accept them to the block index once there are REDOWNLOAD_HEADERS_CHECK_BITS correct commitments on top of them - with the batch size N of headers that are being accepted to the block index being just an (unimportant) consequence of the batch size in which we happen to receive headers from peers (2000 once the lookahead buffer is filled)?
mzumsande
commented at 5:03 am on August 3, 2022:
contributor
Concept ACK
Not finished with the review yet, but I left some comments I had so far.
(With the latest push, feature_block.py and rpc_blockchain.py currently fail)
sdaftuar force-pushed
on Aug 3, 2022
sdaftuar force-pushed
on Aug 3, 2022
in
src/pow.cpp:76
in
68c05de96aoutdated
70@@ -71,6 +71,57 @@ unsigned int CalculateNextWorkRequired(const CBlockIndex* pindexLast, int64_t nF
71 return bnNew.GetCompact();
72 }
7374+// Check that on difficulty adjustments, the new difficulty does not increase
75+// or decrease beyond the permitted limits.
76+bool PermittedDifficultyTransition(const Consensus::Params& params, int64_t height, uint32_t old_nbits, uint32_t new_nbits)
Would it be possible to also add PermittedDifficultyTransition() coverage to the fuzz test src/test/fuzz/pow.cpp? I’m thinking of something like combining it with the new target from GetNextWorkRequired(), and then asserting that the resulting transition passes PermittedDifficultyTransition().
64+ // today. This serves as a memory bound on how many commitments we might
65+ // store from this peer, and we can safely give up syncing if the peer
66+ // exceeds this bound, because it's not possible for a consensus-valid
67+ // chain to be longer than this (at the current time -- in the future we
68+ // could try again, if necessary, to sync a longer chain).
69+ m_max_commitments = 6*(GetAdjustedTime() - chain_start->GetMedianTimePast() + MAX_FUTURE_BLOCK_TIME) / HEADER_COMMITMENT_FREQUENCY;
5691c174c5cb2f9947e6cc2db8d32b4b4715e47d: note to other reviewers: when using this code to sync from the genesis block in the year 2100, m_max_commitments would be about 32 million. That’s less than 4 MB per peer, assuming bitdeque has low overhead.
Of course by that time the baked in value of nMinimumChainWork might be too low.
in
src/headerssync.cpp:8
in
5691c174c5outdated
0@@ -0,0 +1,318 @@
1+#include <headerssync.h>
2+#include <logging.h>
3+#include <pow.h>
4+#include <timedata.h>
5+#include <util/check.h>
6+
7+//! Store a commitment to a header every HEADER_COMMITMENT_FREQUENCY blocks
8+constexpr size_t HEADER_COMMITMENT_FREQUENCY{571};
105+ return ret;
106+ } else {
107+ // If we're in INITIAL_DOWNLOAD and we get a non-full headers
108+ // message, then the peer's chain has ended and definitely doesn't
109+ // have enough work, so we can stop our sync.
110+ LogPrint(BCLog::NET, "Ignoring low-work chain (height=%u) from peer=%d\n", m_current_height, m_id);
Just to make sure I understand your question – I think you’re wondering why this log message appears here as well as in net_processing.cpp at line 2473?
The issue is that in net_processing we won’t bother even starting a headers sync using the logic in headerssync.cpp if we receive a headers message that (a) isn’t full and (b) which connects to something we know in our block index and (c) has too little work, because we know in this case that the headers don’t lead to a high work chain (if it did, the headers message would have been full, indicating the peer has more to give us).
However, in the case that we start a headers sync using this module, then net_processing will be receiving headers that don’t connect to the block index, and those messages can only be properly processed by this logic which keeps track of where we expect the headers to connect. So in this logic we also are checking whether the sync has to end because our peer has no more headers to give us (ie the headers message we processed from them was not full, and the chain has too little work).
The other confusing bit is the comment “and definitely doesn’t have enough work”. When looking at this function in isolation, it’s not obvious why this would be the case, since we don’t even check the work. It’s also not obvious if I look at where this is called from.
5691c174c5cb2f9947e6cc2db8d32b4b4715e47d : can you add the height to this message? That and a message when the state changes, makes it easier to follow what’s happening in the -debug=net log.
Sjors
commented at 4:16 pm on August 5, 2022:
member
Very cool stuff. Some initial comments. Review test hint: -debug=net -stopatheight=100
Update 2022-08-06: why do we need bitdeque instead of just std::vector<bool>? Is it only because we call m_header_commitments.pop_front()? Couldn’t we also just keep the commitments, track our index with an integer and then delete the whole thing when we’re done?
in
test/functional/p2p_dos_header_tree.py:66
in
5691c174c5outdated
62@@ -62,7 +63,7 @@ def run_test(self):
6364 self.log.info("Feed all fork headers (succeeds without checkpoint)")
65 # On node 0 it succeeds because checkpoints are disabled
66- self.restart_node(0, extra_args=['-nocheckpoints'])
67+ self.restart_node(0, extra_args=['-nocheckpoints', "-minimumchainwork=0x0"])
5691c174c5cb2f9947e6cc2db8d32b4b4715e47d: maybe add additional tests where you set minimumchainwork to the chainwork at the original checkpoint (0x0000000000000000000000000000000000000000000000000000022302230223).
01 self.log.info("Feed all fork headers (fails due to minimum chain work)")
2# Disable checkpoint, but require its chainwork as the minimum.3 self.restart_node(0, extra_args=['-nocheckpoints', "-minimumchainwork=0x0000000000000000000000000000000000000000000000000000022302230223"])
4 peer_minimumchainwork = self.nodes[0].add_p2p_connection(P2PInterface())
5 peer_minimumchainwork.send_and_ping(msg_headers(self.headers_fork))
6with self.nodes[0].assert_debug_log(['Ignoring low-work chain']):
7 peer_minimumchainwork.send_message(msg_headers(self.headers_fork))
8# We don't disconnect
Note that if you also set the minimum chainwork for node 0 for the entire test, it will stall at “Feed all fork headers (fails due to checkpoint)”, because we no longer disconnect.
Not sure if the above covers more ground than the new test file introduced in 7be81ce6257b1940d633b1311432c09fbbe09a0b, but it might still be useful to illustrate the behavior change. Then again, in the real world our minimum chainwork is much higher than that of the most recent checkpoints, so this example may be more confusing than illustrating?
Do we need to rethink our headers sync timeout and/or disconnect logic? What happens when the first peer we try to do a headers sync with sends a low work chain? Since we don’t disconnect, is there another mechanism that causes us to try with another peer, or do we wait for a timeout or new block announcement? (trying to figure this out from just reading net_processing.cpp is daunting…)
But still, if the first peer gives us low PoW headers - and we don’t disconnect from it as we do know when it doesn’t contain a checkpoint - that would cause us to do nothing for quite a while. At least IIUC.
Actually new blocks, once every 10 minutes. That’s the current protection in practice against peers stalling headers sync forever by being very slow - once a new block is found on the network, we effectively start headers sync with all peers that announce it to us (unintentionally). #25720 changes it to just one new peer every block.
So currently peers can stall us by sending valid headers slowly (ensuring the first checkpoint is not reached within ~10 minutes). After this PR they can stall us regardless of the how fast the headers are sent. But just sending headers slowly is simple enough to implement, so I suppose this PR does not make that problem worse.
So currently peers can stall us by sending valid headers slowly (ensuring the first checkpoint is not reached within ~10 minutes). After this PR they can stall us regardless of the how fast the headers are sent. But just sending headers slowly is simple enough to implement, so I suppose this PR does not make that problem worse.
This is not my understanding. Peers that we initially choose for headers sync currently have a time window of ~20 minutes (15min HEADERS_DOWNLOAD_TIMEOUT_BASE + a component based on the expected number of headers). If we don’t have an up-to-date best header by then (no matter if we even received that from them or another peer), then they get disconnected. This logic is unaffected by this PR, although peers now have to serve the headers chain twice within these 20 minutes, so they need to be twice as fast.
Maybe it could make sense to add some criteria based on actual progress to this - if a peer has failed to serve us a single header for 10 minutes, does it really make sense to give them another 10 minutes?
The fact that we may also pick additional peers to sync headers from whenever a new block is found (10 minutes on average, but obviously very fluctuating) is independent from that, and also independent from the behavior of a potentially stalling peer. It could sometimes happen a few seconds after the headerssync start if a lucky block is mined then.
That’s still different from the case where a peer feeds us a set of headers that doesn’t include a checkpoint. In that case we immediately disconnect and move on the next peer. That behavior goes away with this PR (though it’s probably not a big deal).
Do we need to rethink our headers sync timeout and/or disconnect logic? What happens when the first peer we try to do a headers sync with sends a low work chain? Since we don’t disconnect, is there another mechanism that causes us to try with another peer, or do we wait for a timeout or new block announcement? (trying to figure this out from just reading net_processing.cpp is daunting…)
So currently peers can stall us by sending valid headers slowly (ensuring the first checkpoint is not reached within ~10 minutes). After this PR they can stall us regardless of the how fast the headers are sent. But just sending headers slowly is simple enough to implement, so I suppose this PR does not make that problem worse.
Note that ever since #5927, checkpoints only affect block/block header validation AFTER we have downloaded a headers chain that includes the checkpointed hashes. So when a new node is starting up for the first time, if its initial headers sync peer serves a bogus chain, it would still be validated and accepted (even if it includes blocks at heights that have different block hashes checkpointed at those heights), because the main chain with the checkpointed block hashes wouldn’t have been downloaded yet.
But that distinction is not so important, because for the purposes of (1) determining when the initial headers sync peer is bad/broken, or (2) determining whether a new node starting up can be flooded with low difficulty headers, an adversary has other options to cause us the same problems. In the case of issue (1), as @mzumsande explained, for bandwidth reasons we give our initial headers sync peer roughly 20 minutes before we’ll disconnect them if we haven’t yet seen a chain that looks close to caught up. So a peer looking to stall us can just give us no headers at all (rather than give us a bogus headers chain) and can still stall us for the same amount of time, under that check. In the case of issue (2), even if block hashes were being checked at the checkpointed heights prior to downloading a chain with those hashes, an adversary looking to waste our memory could just give us a lots of chains that start at genesis and build up to the first checkpoint, and then restart from genesis again with a new chain.
I believe this PR completely solves issue (2).
Regarding issue (1): for a very long time, our protection against a stalling peer has been that we will initiate headers sync (via a getheaders message) with any peer sending us a block INV. This is overkill and made somewhat more bandwidth-wasteful after this PR, so #25720 tries to strike a better balance between robustness in the face of stalling peers, and bandwidth usage. However I believe exactly how we strike that balance is orthogonal to the behavior introduced here, so it makes more sense to discuss strategies for when we start to sync headers with a peer in that PR, rather than here.
in
src/net_processing.cpp:2467
in
5691c174c5outdated
5691c174c5cb2f9947e6cc2db8d32b4b4715e47d: I was a bit surprised this never triggered an unused variable warning, but that’s apparently possible with non-trivial types: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55203
in
src/net_processing.cpp:2386
in
5691c174c5outdated
2318+ arith_uint256 near_chaintip_work = 0;
2319+ LOCK(cs_main);
2320+ if (m_chainman.ActiveChain().Tip() != nullptr) {
2321+ const CBlockIndex *tip = m_chainman.ActiveChain().Tip();
2322+ // Use a 144 block buffer, so that we'll accept headers that fork from
2323+ // near our tip.
5691c174c5cb2f9947e6cc2db8d32b4b4715e47d: I assume we’re interested in those, because although their total work is lower, there’s a chance a miner builds on top of it. In that case, if we receive only that new block (header), we already have the parent? Otherwise we’d need a peer to send multiple headers to learn about this new block.
Yeah, so when I started writing this code I think I had no buffer at all, which turns out to cause problems when there’s a small reorg (requiring extra round-trips for blocks to successfully relay). I think this also caused zillions of our functional tests to fail.
Upon further thought, I reasoned that for anti-DoS purposes, it’s enough to have any kind of work threshold that grows with the tip, rather than is pegged at the tip’s work exactly. So I picked something big enough that we should never have a problem with headers not relaying if they’re at all close to the tip, while small enough that I figured an attacker would not be able to keep up for very long unless they had a huge amount of hashpower. The exact value here is pretty arbitrary though.
in
src/headerssync.cpp:80
in
5691c174c5outdated
89+ if (m_download_state == State::FINAL) return ret;
90+
91+ if (m_download_state == State::INITIAL_DOWNLOAD) {
92+ // During INITIAL_DOWNLOAD, we minimally validate block headers and
93+ // occasionally add commitments to them, until we reach our work
94+ // threshold.
In this paper the author suggested a solution (section 4.3) using proof-of-work to limit the rate of downloading headers (as well as pruning alternate header chain branches aka “chain width”).
I wonder how this strategy compares to the double-download strategy in this PR. They might be close in total elapsed time (slow downloading vs. downloading twice) but would obviously save on bandwidth.
Conceptually I love the idea of replacing hard-coded checkpoints, I’m just curious how this alternate solution measures up.
Sjors
commented at 8:36 am on August 8, 2022:
member
@pinheadmz of the three attack scenarios in the paper, 1.3 is prevented because we ignore unsolicited blocks with less work than the tip (see p2p_unrequested_blocks.py). The paper also points this out in (2). So the two attacks to worry about are 1.1 and 1.2. If you can solve 1.2 without the use of checkpoints, then 1.1 is automatically solved.
So the comparison boils down to: how does this PR stack up against the proposal in the paper in fixing 1.2.
Specifically the solutions we’re looking at are 4.2 and 4.3 in the paper, since 4.1 (headers-first sync) is already implemented.
Note that bandwidth for headers is only a fraction of bandwidth for blocks. All things equal less bandwidth is better, but it’s not a huge factor on its own when comparing the approaches.
The same goes for elapsed time. Downloading headers takes only a fraction of the time of downloading blocks.
Also, because we can receive headers unsolicited, the amount of time and bandwidth we spend downloading useless headers is somewhat out of our control (rate limiting based on PoW could help though, see my comment on 4.3).
I think there’s two metrics we should care about, given an ongoing spam attack and a single honest node whispering the truth:
how long does it take until we know enough to go ahead and download the corresponding blocks
how much of disk space gets wasted with spam headers (and can we prune that before it becomes a serious problem)
One meta issue is that afaik there’s no Bitcoin Core pull request of 4.1 and 4.2 from the paper, so that works in favor of this PR which does have working code (from a rfc7282 point of view). But the paper states there is an implementation for Bcoin, so this is not a strong argument against. What do we know of how these implementations work in practice?
With all that out the way, how do the proposals 4.2 and 4.3 from the paper actually compare to this PR?
4.2 suggests picking a maximum number of header chains to store, before pruning the shortest. It suggests storing 10 chains for a total of 480MB.
a) that’s far more memory use than this PR requires (~700 KiB per peer, regardless of the amount of spam fired at us)
b) In terms of bandwidth, under nice circumstances the approach uses less bandwidth, because headers are only downloaded once. But under attack circumstance most of the bandwidth is spent on downloading spam headers, in which case the redownload of correct headers is negligible. That goes for 4.2 as well as this PR.
c) I’m not sure how the proposal holds up in a timewarp attack; I suppose it could perform the sane initial sanity checks on all header chains, e.g. checking the difficulty adjustment is allowed. Dealing with a timewarp attack is important because it would dramatically increase the space needed to store a header chain.
4.3 Rate limiting header requests: I’m not sure how to evaluate that (others on the mailinglist also expressed difficulty with that). I do notice the paper points to advantages for CPU use, but I’m more worried about storage. I’m also not sure if this approach would put an “honest” massive reorg (by alien attacker) at an unfair disadvantage, e.g. if said reorg started with lots of low difficulty blocks. But I guess eventually we’d find it.
Sjors
commented at 10:19 am on August 8, 2022:
member
But the paper states there is an implementation for Bcoin […] What do we know of how these implementations work in practice?
147+ * ProcessingResult.success: set to false if an error is detected and the sync is
148+ * aborted; true otherwise.
149+ * ProcessingResult.locator: if present, the next locator to send in a
150+ * getheaders message
151+ */
152+ ProcessingResult ProcessNextHeaders(const std::vector<CBlockHeader>&
Passing headers in and out of this function is a bit hard to follow. I think we need a more descriptive term than “process” and “to process”. You could rename the input headers to received_headers and the return value headers_to_process with pow_checked_headers or filtered_headers.
But perhaps it’s even more clear if you decouple the feeding of headers into HeadersSyncState from the step of extracting filtered headers. Though I suppose these do have to happen in lock step.
I’m doing the rename you suggest, but yes pulling headers out of HeadersSyncState needs to happen in lockstep with processing, so I think rather than add another API call to extract them, it makes the most sense to just return them.
in
src/net_processing.cpp:5077
in
2444ce2f8coutdated
5071@@ -5076,6 +5072,24 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
50725073 MaybeSendAddr(*pto, *peer, current_time);
50745075+ // Delay sending SENDHEADERS (BIP 130) until we're done with an
5076+ // initial-headers-sync with this peer. Receiving headers announcements for
5077+ // new blocks while trying to sync their headers chain is problematic,
5078+ // because of the state tracking done.
5079+ if (!peer->m_sent_sendheaders && pto->GetCommonVersion() >= SENDHEADERS_VERSION) {
2444ce2f8cb1330ba1dc88b7e5407a3efcfd9aeb The above MaybeSendPing and MaybeSendAddr suggests we should put this in a helper function MaybeSendSendHeaders()
Switch to randomized commitment offset (which the simulation script shows reduces the attacker’s power a bit, permitting lower per-peer memory usage)
Send locators that contain mixes of the best-known global state (m_chainman.m_best_header) and the current per-peer headerssync state. This seems a very robust way to avoid a whole lot of duplicate work/bandwidth. It means that if we’ve received (in 2nd phase) headers from peer A up to height H, while we’re simultaneously syncing headers (phase 1 or 2) from peer B at height below H, the locator we send to B will make them instantly skip ahead to wherever we already are with A (starting a new phase 1 there). If we’ve finished syncing with A, it will cause an abort with B if they only have the same headers.
Add a bunch of logging
Interface changes:
The StartInitialSynchronization function is gone; part of its arguments are moved to the constructor, the initial headers argument to it is moved to the caller having to invoke ProcessNextHeaders. This removes some duplication in functionality, and simplifies the implementation too, as now PeerManagerImpl::TryLowWorkHeadersSync can invoke PeerManagerImpl::IsContinuationOfLowWorkHeadersSync to actually deal with the headers, removing some duplication that was introduced by the tip-mixing above.
The MakeNextHeadersRequest function is now part of the public interface, and is expected to be invoked by the caller when ProcessNextHeaders reports request_more (rather than being invoke in-line by ProcessNextHeaders directly). This avoids the need to lock cs_main when nothing is to be requested (to get the global best header tip).
Sjors
commented at 10:13 am on August 10, 2022:
member
@sipahere’s a log when I tried your rebased branch on mainnet. Just as it reached 90% synced headers with one of the peers, it started with another peer, quickly reached the sufficient work and started redownloading. But then it aborted with “non-continuous headers at height=680041 (redownload phase)”. It does eventually complete the download, with this new peer.
There was no new block produced at the time where it started to sync with the new peer. Unfortunately I didn’t log enough other things to see what could have triggered it.
sipa
commented at 3:42 pm on August 10, 2022:
member
@Sjors Cool, that abort/restart of peer=3 at the same height was surprising to me, but I’ve reproduced it.
sipa
commented at 7:19 pm on August 10, 2022:
member
@Sjors Fixed it in both branches. The problem was that when asking for the initial headers from a peer (outside of the HeadersSyncState logic), we ask for one block earlier than our actual tip, so that we always get a meaningful result back, even if the peer has the same tip already. However, in my logic to mix in the tip into HeadersSyncState getheader requests, I didn’t make it make this one-step-back. The result was that if you transition from one peer to another when headers syncing, you’d initialize at one block backward, but then the first headers arriving after that would start one further, causing an abort and restart.
Sjors
commented at 8:18 am on August 11, 2022:
member
Aside from the bug, what triggers the transition to another peer? It definitely wasn’t a timeout.
sipa
commented at 12:06 pm on August 11, 2022:
member
@Sjors They may have just disconnected for whatever reason.
sdaftuar
commented at 2:38 pm on August 11, 2022:
member
4.2 suggests picking a maximum number of header chains to store, before pruning the shortest. It suggests storing 10 chains for a total of 480MB.
@pinheadmz@sjors Absent other details, I think this approach would fail at guaranteeing that we eventually reach consensus with our peers. Imagine that some adversary has given us 10 bogus chains, and we connect to an honest peer for the first time. What do we do during headers sync with that peer, in the window of time that the headers we’re downloading are far behind (in work) from the work on the bogus chains we have stored already? If we prune as we go, then we will make no progress towards downloading the honest chain; but if we don’t, then we’re open to an attack from an adversary that would serve us a huge, low-work chain (eg due to timewarp).
sdaftuar force-pushed
on Aug 11, 2022
laanwj added this to the milestone 24.0
on Aug 11, 2022
sdaftuar force-pushed
on Aug 11, 2022
sdaftuar
commented at 7:39 pm on August 11, 2022:
member
Update 2022-08-06: why do we need bitdeque instead of just std::vector? Is it only because we call m_header_commitments.pop_front()? Couldn’t we also just keep the commitments, track our index with an integer and then delete the whole thing when we’re done?
Two reasons, I think – one is that we use more memory that way, the other is that the bookkeeping seemed more complicated to think about than if we pop entries as we go.
sipa
commented at 8:09 pm on August 11, 2022:
member
Two reasons, I think – one is that we use more memory that way, the other is that the bookkeeping seemed more complicated to think about than if we pop entries as we go.
Another is that with an anticipated (but not currently implemented) optimization, the m_chain_start point may move forward while commitments are being added, necessitating popping pieces off from the front of m_commitment_bits).
sdaftuar force-pushed
on Aug 11, 2022
sdaftuar
commented at 8:44 pm on August 11, 2022:
member
Updated this PR to include the many improvements @sipa drafted in his branch (thank you!). Also included a fuzz test by @mzumsande, and addressed many of the review comments from @sjors.
instagibbs
commented at 8:58 pm on August 11, 2022:
member
Would be nice to give some log feedback that the anti-DoS header sync has at least started. Currently for testnet it’s ~3 minutes of nothing of use being printed.
related question: Is the sync state exposed anywhere via RPC? getchaintips showed nothing happening either.
sipa
commented at 9:04 pm on August 11, 2022:
member
@instagibbs Yeah, there was some chatter about that, but it’s genuinely hard to do: there isn’t any global state that is even aware of the progress made until the second phase is hit (as the first phase takes place entirely in per-peer structures).
My best idea (but I haven’t gotten past an idea):
Make net_processing keep track of the per-peer 1st phase progress, and e.g. aggregate it into a maximum.
Make net_processing report that to validation (because validation is reponsible currently for emitting callbacks about sync progress).
Change the callback interface for headers to not use CBlockIndex* anymore, but height/time/progress base.
Add ways to notify 1st/2nd phase along with this.
Make all listeners of these notifications aware of it.
Just logging is a lot easier of course, but it’s unclear how much should be unconditionally logged. I’m not sure a per-headerblob logging is even appropriate.
sipa
commented at 9:05 pm on August 11, 2022:
member
@instagibbs Yeah, there was some chatter about that, but it’s genuinely hard to do: there isn’t any global state that is even aware of the progress made until the second phase is it (as the first phase takes place entirely in per-peer structures).
My best idea (but I haven’t gotten past an idea):
Make net_processing keep track of the per-peer 1st phase progress, and e.g. aggregate it into a maximum.
Make net_processing report that to validation (because validation is reponsible currently for emitting callbacks about sync progress).
Change the callback interface for headers to not use CBlockIndex* anymore, but height/time/progress base.
Add ways to notify 1st/2nd phase along with this.
Make all listeners of these notifications aware of it.
Just logging is a lot easier of course, but it’s unclear how much should be unconditionally logged. I’m not sure a per-headerblob logging is even appropriate.
ajtowns
commented at 3:29 am on August 12, 2022:
contributor
related question: Is the sync state exposed anywhere via RPC? getchaintips showed nothing happening either.
Seems like it might be good to at least report the per-peer value of m_current_height in getpeerinfo? Perhaps reporting the nTime or tracking the median time of the best initial sync block would be nice, then you could report the best one via the gui to give people some idea of progress?
Also, I’m not sure it makes sense to have the initial header sync phase limited to a single peer?
After adding the above to getpeerinfo, and watching a header sync of testnet from scratch, it first did an initial sync up to about block 1M with peer 0, then peer 0 disconnected, and it immediately started an initial header sync with all peers, with peer=8 eventually winning the race and being the one to send all the actual headers.
It looks to me like a new block being found means I’ll receive an INV from all nodes, which will then trigger me sending a GETHEADERS to all nodes, which then gets me doing an initial sync with all nodes anyway.
I’m seeing some nodes get through initial sync at about twice the speed of others, so seems like it might be a worthwhile speedup even if your first node doesn’t decide to disconnect.
in
src/headerssync.cpp:191
in
6862e74c70outdated
186+ // work chain if they compress the work into as few blocks as possible,
187+ // so don't let anyone give a chain that would violate the difficulty
188+ // adjustment maximum.
189+ if (!PermittedDifficultyTransition(m_consensus_params, next_height,
190+ previous.nBits, current.nBits)) {
191+ LogPrint(BCLog::HEADERSSYNC, "Initial headers sync aborted with peer=%d: invalid difficulty transition at height=%i (commitment phase)\n", m_id, next_height);
I think just for code-review/testing’s sake – NET gets a lot of traffic.
sdaftuar force-pushed
on Aug 12, 2022
sdaftuar
commented at 2:16 pm on August 12, 2022:
member
related question: Is the sync state exposed anywhere via RPC? getchaintips showed nothing happening either.
Seems like it might be good to at least report the per-peer value of m_current_height in getpeerinfo? Perhaps reporting the nTime or tracking the median time of the best initial sync block would be nice, then you could report the best one via the gui to give people some idea of progress?
Just added a commit that exposes m_current_height in getpeerinfo(), and also updated the branch to include tests for the handling of low-work compact blocks and unrequested blocks.
Also, I’m not sure it makes sense to have the initial header sync phase limited to a single peer?
Did you see Pieter’s comments on the bandwidth issue here? Also, I think it makes most sense to discuss this topic in #25720, as the bandwidth vs. robustness issues around initial headers sync are largely independent of this change (though this PR does increase the bandwidth we’d expect to use).
At this point I think I’ve responded to or addressed all the comments so far (thank you reviewers!), with the exception of the logging issue that has been raised. I will continue to think about that, but given how invasive it may be to address this properly, I wonder what people think of deferring that improvement to a followup PR?
sipa
commented at 11:24 pm on August 13, 2022:
member
There is a lock ordering bug in the current branch, which can lead to a deadlock when opening the peer view window in the GUI. There was a crossing of m_header_sync_mutex and cs_main; GetNodeStateStats is being called while holding cs_main sometimes, and it needs m_header_sync_mutex, which means that elsewhere we cannot grab cs_main while already holding m_header_sync_mutex. There is a commit that fixes this by pushing down the m_header_sync_mutex locking; an alternative was grabbing cs_main for all of the headers sync processing, but the more fine-grained lock approach is cleaner overall, I think.
Aggregate statistics about (the first phase of) low-work headers sync across all peers in net_processing
Report those statistics to validation
Let validation rate-limit those reports, and then use them for logging and emitting signals with progress updates
Let the GUI use those for reporting as well.
In the logs/UI and code related to it I’ve picked the term “headers pre-synchronization” to refer to the first phase of low-work headers sync (as that’s a bit of a mouthful). Bikeshedding welcome - but whatever we pick it may be useful to be consistent in naming in other places too.
sipa
commented at 5:02 pm on August 14, 2022:
member
So a few open questions regarding the logging/progress:
Is it necessary to address logging/progress before 24.0? It’s a somewhat bad UX if we don’t, as there is no feedback indicating anything is happening at all, for possibly several minutes. Addressing it does seem to need some amount of complexity, it seems.
If Qt GUI changes for reporting 1st phase headers sync are considered needed, there is the concern that it adds a few more translation strings. Not having any reporting at all is however worse than untranslated ones, I would guess?
Should the GUI changes be done through a separate PR to the GUI repository?
My branch above doesn’t add a GUI element in the peer view details for first-phase headers height yet; I’d consider that much lower priority than just having progress bar updates for it, and probably better done by someone familiar with Qt.
hebasto
commented at 6:08 pm on August 14, 2022:
member
My branch above doesn’t add a GUI element in the peer view details for first-phase headers height yet; I’d consider that much lower priority than just having progress bar updates for it, and probably better done by someone familiar with Qt.
Agree.
Should the GUI changes be done through a separate PR to the GUI repository?
I assume we have no such a strict policy. If reviewers are ok to review and test Qt code here, why not? But separating low-priority (in comparison to a new header sync logic) GUI changes into a dedicated PR in the GUI repo sounds better.
If Qt GUI changes for reporting 1st phase headers sync are considered needed, there is the concern that it adds a few more translation strings. Not having any reporting at all is however worse than untranslated ones, I would guess?
It depends on how many “a few”. I guess an additional week will be enough for translators to update their translations with a couple of new strings.
sipa
commented at 6:52 pm on August 14, 2022:
member
It depends on how many “a few”. I guess an additional week will be enough for translators to update their translations with a couple of new strings.
Two, both variations of “Pre-syncing headers” (assuming that’s the terminology we’ll settle on).
in
src/net_processing.cpp:2515
in
6862e74c70outdated
I think it might make sense to add an if (!result.success) return true; when you end up in FINAL state – otherwise (at least for me) it looks like ordinary header processing then immediately triggers a restart.
As mentioned here, the intent is indeed for ordinary headers processing – that is, a headers message which connects to a different point in the locator we send than continuing from the last header they sent us – to trigger a restart of this logic (hopefully at a later point in the sync, but potentially anywhere, e.g. if the peer underwent a reorg during sync).
I assume this is also moot, now that mixing in locators from m_best_header has been removed.
in
src/net_processing.cpp:2663
in
6862e74c70outdated
2659+ // (which is done separately inside m_headers_sync)
2660+ bool already_validated_work = false;
2661+
2662+ // If we're in the middle of headers sync, let it do its magic.
2663+ if (IsContinuationOfLowWorkHeadersSync(peer, pfrom, headers, previously_downloaded_headers)) {
2664+ return;
If the suggestion above is adopted, then header sync failure will result in exiting here without a request for more headers being sent out to this peer. In that case, I think you’d also want to add code to toggle fSyncStarted and decrement nSyncStarted to allow another peer to pick up where this one left of. Needs some care to avoid cs_main lock ordering problems of course.
in
src/headerssync.cpp:105
in
6862e74c70outdated
97+ // receive, and add headers to our redownload buffer. When the buffer
98+ // gets big enough (meaning that we've checked enough commitments),
99+ // we'll return a batch of headers to the caller for processing.
100+ ret.success = true;
101+ for (const auto& hdr : received_headers) {
102+ if (!ValidateAndStoreRedownloadedHeader(hdr)) {
We expect to skip headers here due to including locators based on our tip, but ValidatedAndStoreRedownloadedHeader can’t cope with skipped headers. This doesn’t seem sensible to me.
Again see #25717 (comment); the intent is that by using a locator which mixes in entries from m_best_header, the peer can skip ahead in the sync (in a scenario where our best header is updating from other peers) to reduce overall bandwidth. In such an event, this code would return an error and we’d drop into starting a new sync with the peer.
This should be moot now that mixing in locator entries from m_best_header has been removed.
ajtowns
commented at 11:14 am on August 15, 2022:
contributor
With this commit (ac4cd62b3abfb4b28135e4f126fc672ca606b643) and #25720, I’m seeing failures after one peer completes its initial download, and the tip progresses past either where other nodes are up to in their initial download, or (less commonly) in their redownload.
In the former case, I think the only real bug is that they then immediately restart their initial download which makes for noisy logs and wastes bandwidth (since they’re getting headers more slowly than the peer that’s redownloading).
In the latter case, that wastes the entire initial download, which seems more wasteful, and seems like it might introduce some new (mild?) DoS behaviours.
Not necessarily blocking, but it feels buggy/ugly to me as it stands.
sipa
commented at 2:59 pm on August 15, 2022:
member
@ajtowns I believe you’re seeing the effects of the mixing-in-m_best_header-based-skipping, but may be misunderstanding what is happening.
First of all, ValidatedAndStoreRedownloadedHeader indeed doesn’t deal with the peer deciding to continue from one of the mixed-in locator entries, but it doesn’t need to; we will, in the same processing of the received headers message, abort the headers sync, and start a new one, with the first header in that received message as starting point. Assuming it is an honest peer, doing so is always advantageous as it means it’s skipping some headers sync work with that peer (the locator entries are sorted in such a way that this is the case). Note that while it’s possible during INITIAL_DOWNLOAD phase that the actual height of synchronization goes backwards, this may still be advantageous as it means less work to redo during REDOWNLOAD (because while the tip moves backwards, the starting point moves forward).
The intention is that once we start synchronizing headers with a peer, we don’t ever stop doing so; it up to them to stop giving us headers (by reaching the sync end). We mix in headers from the m_best_header tip to give the peer a chance to skip ahead if there is some part both sides already have, but it is up to them. If there is one peer faster than all others, and in the process of moving ours m_best_header tip forward with every received headers blob, then this means all other peers will continuously be skipping ahead to keep up, but always stay behind this faster peer, until the faster peer reaches the end of the synchronization, at which point all other peers will skip ahead to the end as well, and stop. So in short: while headers sync is ongoing, we will effectively be continuously asking for headers from all headers-syncing peers all the time, until it’s done. This is already the case in the current codebase actually, though this PR makes it worse by (a) taking longer overall before synchronization completes even with a single peer and (b) only making progress in synchronization visible to other, slower peers, during the second phase. Some of this downside is mitigated by this mixing in, causing headers sync with other (honest) peers to stop as soon as sync with one peer completes, rather than needing to wait for them to get to the end on their own.
It’s not impossible to detect certain conditions, like a peer giving us things we’re clearly not interested in anymore, to just give up on the synchronization with them, but there are some scary edge cases to think about then. For example, imagine a peer who has a deep reorg of the honest chain (however unlikely that may be, we should be able to deal with it), which forks off more than 2000 headers past the highest previous locator entry in the m_best_header based locators. A newly started header sync at that point will just result in 2000 headers we already have, and aborting the sync entirely and starting over later from m_best_header again will not make any progress towards ever actually getting the peer’s chain. For that reason, it seems safer to just keep fetching, and letting the “skipping” be done by the remote party - they know best what they’re trying to give us.
There are two other improvements possible to consider that can reduce bandwidth waste:
Being more careful about which peers we start synchronizing with, in a controlled way that still results in us eventually trying all peers, so we’ll certainly eventually get whatever headers any and all of them have to give us. #25720 is a step in that direction.
Moving the start point forward: if we receive headers from a peer that we already have, in some cases it is possible to just move the HeadersSyncState m_chain_start point forward (and correspondingly pop off commitments from the buffers in it), so that we restart redownload further ahead when we get there. I have a half-finished commit that does this, but it needs thinking about correctness (again related to this deep reorg scenario above), and adds a bunch of complexity we may not want in this PR.
sdaftuar force-pushed
on Aug 15, 2022
sdaftuar
commented at 4:17 pm on August 15, 2022:
member
Thanks @sipa, I have pushed your latest branch with the logging code here (and I squashed the lock-inversion fix into the commit where I had introduced the bug).
in
src/headerssync.h:238
in
57d8791463outdated
225+
226+ /** Store the latest header received while in INITIAL_DOWNLOAD */
227+ CBlockHeader m_last_header_received;
228+
229+ /** Height of m_last_header_received */
230+ int64_t m_current_height{0}; // height of last header received
mzumsande
commented at 5:12 pm on August 15, 2022:
in
src/net_processing.cpp:2758
in
57d8791463outdated
2641+
2642+ // Before we do any processing, make sure these pass basic sanity checks.
2643+ // We'll rely on headers having valid proof-of-work further down, as an
2644+ // anti-DoS criteria.
2645+ if (!CheckHeadersPoW(headers, m_chainparams.GetConsensus(), peer)) {
2646 return;
mzumsande
commented at 5:34 pm on August 15, 2022:
Should we call Misbehaving() if the PoW is invalid?
ajtowns
commented at 8:52 pm on August 15, 2022:
contributor
First of all, ValidatedAndStoreRedownloadedHeader indeed doesn’t deal with the peer deciding to continue from one of the mixed-in locator entries, but it doesn’t need to; we will, in the same processing of the received headers message, abort the headers sync, and start a new one, with the first header in that received message as starting point. Assuming it is an honest peer, doing so is always advantageous […]
I don’t think that’s correct. Restarting headers sync pushes you back into initial download, so the behaviour on testnet for a peer that’s finished initial sync but skips some blocks during redownload looks like:
download 2M headers for initial sync
download 500k headers for redownload phase
find that the tip is up to height 550k
download headers from 550k to 552k that don’t connect at 500k, and restart
find that in the meantime the tip is now up to height 560k, and your faster peer has died
download 1.44M headers (for blocks at height 560k to 2M) for initial sync
redownload the 1.44M headers to actually update the tip
That is, you’ve saved downloading 50k headers, at a cost of downloading 1.44M headers in the redundant initial sync. That’s not advantageous…
Being able to drop another peer from redownload phase back to initial sync seems like it has scary dos opportunities to me, and just searching through the history of the tip that we sent to them seems a very easy way of mitigating the problem in the honest-peer case.
It’s definitely less important if you’re only in the initial sync phase, while some other peer is further along in the redownload phase; you are only getting pushed forward there.
But at that point you’ve already been “lapped”, and the other peer is demonstrably currently supplying valid headers faster than you. In the normal case there, you’re not going to win; and even if you’re assuming the other node is an attacker, their attack is just “delivering valid headers quickly”… Continuing to try initial headers normally in those circumstances just seems like a waste of bandwidth – so waiting until the good nodes disconnect or a new block comes in seems like it would be better logic for the non-normal cases.
I guess at the very least, if that path is normal and expected, then constantly logging it as “aborted non-connecting // headers sync started” doesn’t seem helpful.
It’s not impossible to detect certain conditions, like a peer giving us things we’re clearly not interested in anymore, to just give up on the synchronization with them, but there are some scary edge cases to think about then. For example, imagine a peer who has a deep reorg of the honest chain (however unlikely that may be, we should be able to deal with it), which forks off more than 2000 headers past the highest previous locator entry in the m_best_header based locators. A newly started header sync at that point will just result in 2000 headers we already have,
But we don’t consider repeated headers a problem, just ones that don’t connect? And if we’ve already accepted them, they’ll connect fine?
sipa
commented at 8:58 pm on August 15, 2022:
member
Restarting headers sync pushes you back into initial download, so the behaviour on testnet for a peer that’s finished initial sync but skips some blocks during redownload looks like:
That’s not what the code is supposed to do. The double-height computation should in the scenario you describe treat continuing the current sync as better than restarting, order the locator entries to account for that, and assuming the peer picks the first locator entry of the chain they want to send, just continue syncing.
Specifically, once 2M headers have downloaded in INITIAL_DOWNLOAD, and 500k headers have downloaded in REDOWNLOAD, the double-height score for continuing should be 2500000. The double-height score for restarting from 550k should be 1100000. Continuing has a higher score, so should be placed first in the locator.
Is that not what you’re seeing? The code may be buggy of course, or is it possible you were using an older version (this branch briefly had mixing-in logic without the double-height weighing last week, which would behave as you describe here).
In 1178015251732b2db87c0b84dd7ebe3f435a6771 “Utilize anti-DoS headers download strategy”
ValidateAndStoreHeadersCommitments, ValidateAndProcessSingleHeader, ValidateAndStoreRedownloadedHeader, and PopHeadersReadyForAcceptance all have this same check. However they really only have a single valid state they should be operating in, so I think it would be better to check whether the current state is what is expected (INITIAL_DOWNLOAD for the first two, and REDOWNLOAD for the second) rather than that it is not FINAL.
mzumsande
commented at 9:29 pm on August 15, 2022:
We already check the proof of work as part of the basic sanity checks in net_processing (ProcessHeadersMessage), so I don’t think the check could ever fail at this later point.
I was nervous about omitting such an important check in the module that is designed to handle this, but you’re right of course. I added a comment indicating the caller is required to do this, and removed this call.
ajtowns
commented at 9:31 pm on August 15, 2022:
contributor
Is that not what you’re seeing?
Here’s some logs (for commit fc2e19e61261cedade18d1a5b9de2075523fa6aa merged onto 22d96d76ab02fc73e7fe0d810bacee4c982df085). I reset chainwork so that 1,156,000 testnet headers is enough. Peers 0 and 1 managed to complete initial sync. Peer 0’s log entries look like:
02022-08-15T21:14:38Z [headerssync] Initial headers sync started with peer=0: height=0, max_commitments=3738869, min_work=00000000000000000000000000000000000000000000002830dab7f76dbb7d63
12022-08-15T21:17:30Z [headerssync] Initial headers sync transition with peer=0: reached sufficient work at height=1156000, redownloading from height=022022-08-15T21:19:57Z [headerssync] Initial headers sync complete with peer=0: releasing all at height=1156000 (redownload phase)
Peer 1’s log entries look like this:
02022-08-15T21:15:12Z [headerssync] Initial headers sync started with peer=1: height=0, max_commitments=3738870, min_work=00000000000000000000000000000000000000000000002830dab7f76dbb7d63
12022-08-15T21:18:26Z [headerssync] Initial headers sync transition with peer=1: reached sufficient work at height=1156000, redownloading from height=022022-08-15T21:18:57Z [headerssync] Initial headers sync aborted with peer=1: non-continuous headers at height=176001 (redownload phase)
32022-08-15T21:18:57Z [headerssync] Initial headers sync started with peer=1: height=666040, max_commitments=2124319, min_work=00000000000000000000000000000000000000000000002830dab7f76dbb7d63
42022-08-15T21:18:57Z [headerssync] Initial headers sync aborted with peer=1: non-continuous headers at height=668040 (commitment phase)
along with another ~70 started/aborted pairs, as peer 1’s initial sync keeps getting reset to match peer 0’s successful redownload.
Peers 3 and 5 contribute another 379 aborted/started pairs; they never made it out of commitment phase though, so aren’t particularly interesting.
I don’t think the doubling counts matter here at all [EDIT: I think the double height logic is delaying the restart as designed; but there’s still no reason to restart; I’ve got the data the peer skipped, and can easily validate it against the peer’s commitment, so there’s no need to restart] – peer 1 is just failing due to the if (header.hashPrevBlock != m_redownload_buffer_last_hash) check in ValidateAndStoreRedownloadedHeader.
in
src/headerssync.cpp:158
in
1178015251outdated
153+ // logic.
154+ LogPrint(BCLog::HEADERSSYNC, "Initial headers sync aborted with peer=%d: non-continuous headers at height=%i (commitment phase)\n", m_id, m_current_height);
155+ return false;
156+ }
157+
158+ // If it does connect, (minimally) validate and store a commitment to each one.
mzumsande
commented at 9:32 pm on August 15, 2022:
nit: it only stores a commitment to some headers (depending on HEADER_COMMITMENT_PERIOD and m_commit_offset), not to each one.
sipa
commented at 9:41 pm on August 15, 2022:
member
02022-08-15T21:18:57Z [headerssync] Initial headers sync aborted with peer=1: non-continuous headers at height=176001 (redownload phase)
12022-08-15T21:18:57Z [headerssync] Initial headers sync started with peer=1: height=666040, max_commitments=2124319, min_work=00000000000000000000000000000000000000000000002830dab7f76dbb7d63
So, this is good. Peer 1 used to be in a state where it would still need to redownload another 1156000-176001=979999 blocks. By restarting at height 666040, it only has to download (1156000-666040) headers (twice, once in commitment, once in redownload phase), or 979920 headers, anymore. It’s only a tiny improvement, but it’s a strict reduction in the amount of downloading left.
peer 1 is just failing due to the if (header.hashPrevBlock != m_redownload_buffer_last_hash) check in ValidateAndStoreRedownloadedHeader.
I think you’re confusing cause and effect. That check is failing because the peer sent us non-continuous headers. And the peer is doing that because we want it to: restarting is less work than continuing.
sdaftuar force-pushed
on Aug 16, 2022
sdaftuar force-pushed
on Aug 16, 2022
sdaftuar force-pushed
on Aug 16, 2022
sdaftuar force-pushed
on Aug 16, 2022
sdaftuar
commented at 4:10 pm on August 16, 2022:
member
After seeing the discussion between @ajtowns and @sipa regarding bandwidth optimization involving what locators we send and how we process “skipping ahead” in our headers download (in the situation where one of our peers is serving us headers at a faster rate than another), I think it makes sense to drop any such optimization from this PR. Here’s my thinking:
There are a number of ways (such as was done in #25720) to reduce our bandwidth usage during initial headers sync. While this PR increases bandwidth used by a node starting up for the first time, the overall bandwidth usage is still far below what a new node will end up using during block download, particularly now that #25720 has been merged (mitigating bad behavior that existed prior to this PR). Further improvements to bandwidth usage are certainly possible, but the set of strategies we might consider for doing so goes beyond the code touched here.
That is, while we might consider using tricks with this logic to reduce bandwidth usage – whether by mixing in hashes from m_best_header into the locators we send, and/or using headers that are either received from a peer or accepted into our block index to move a given peer’s sync forward – there are also strategies that may be even more effective that don’t involve code changed by this PR. For example, a refinement to #25720 that further reduces the likelihood of performing initial sync with more than one peer simultaneously (such as by introducing a minimum amount of time that we allow our initial headers-sync peer to complete sync, before we add any other peers, even via the block INV mechanism) would materially affect the impact of optimizations designed to improve the behavior when we have multiple sync peers.
Alternatively, deploying a p2p protocol change to compress headers message (as has been discussed in the past, such as by omitting prevHash to save 32 bytes, and compressing other fields such as nVersion and nTime) could lead to bandwidth savings of close to 50%, which would eliminate most of the overhead introduced by this PR, at least for upgraded software that supports such a proposed new protocol.
So given that we have alternatives like these to consider when thinking about bandwidth usage, and particularly given how complex this PR is already, I think deferring bandwidth minimization strategies to future work makes the most sense. For now I’ve removed the optimization that had been in this PR to mix-in locators from m_best_header, which I hope makes the logic easier to reason about.
I’ve also addressed the latest review feedback from @achow101 and @mzumsande, and incorporated the fixups from @sipa to use the PRESYNC terminology in place of INITIAL_DOWNLOAD.
in
src/net_processing.cpp:2447
in
c600a94390outdated
2442@@ -2421,6 +2443,47 @@ bool PeerManagerImpl::IsContinuationOfLowWorkHeadersSync(Peer& peer, CNode& pfro
24432444 if (peer.m_headers_sync->GetState() == HeadersSyncState::State::FINAL) {
2445 peer.m_headers_sync.reset(nullptr);
2446+
2447+ // Update statistics: delete this peer's entry in allstats, and if this was the
mzumsande
commented at 6:27 pm on August 16, 2022:
Comment seems out of date: allstats should be m_headers_presync_stats, and recomputing is not necessary here, because it’s done in the else branch.
sdaftuar
commented at 10:48 pm on August 16, 2022:
Updated comment.
in
src/net_processing.cpp:854
in
85e4b9e8edoutdated
850@@ -851,7 +851,7 @@ class PeerManagerImpl final : public PeerManager
851 EXCLUSIVE_LOCKS_REQUIRED(!m_most_recent_block_mutex, peer.m_getdata_requests_mutex) LOCKS_EXCLUDED(::cs_main);
852853 /** Process a new block. Perform any post-processing housekeeping */
854- void ProcessBlock(CNode& node, const std::shared_ptr<const CBlock>& block, bool force_processing);
855+ void ProcessBlock(CNode& node, const std::shared_ptr<const CBlock>& block, bool force_processing, const bool min_pow_checked);
In commit “Require callers of AcceptBlockHeader() to perform anti-dos checks”
Nit: having const for a by-value function parameter in a declaration is meaningless (it’s not part of the function signature, so even if you have const on the implementation code, it’s not required to be present on the declaration).
sdaftuar
commented at 10:48 pm on August 16, 2022:
Removed.
in
src/headerssync.cpp:151
in
e778d4f995outdated
146+ // Somehow our peer gave us a header that doesn't connect.
147+ // This might be benign -- perhaps we issued an extra getheaders
148+ // message, such as after a block INV was received.
149+ // Or it could be that our peer is broken or malicious. If broken,
150+ // sending a new getheaders immediately could trigger an infinite
151+ // loop. Just give up for now; if our peer ever gives us an block
In commit “Utilize anti-DoS headers download strategy”
Nit: I think this comment is outdated; we don’t actually give up in this scenario, but return false to the net_processing calling code, which then continues different interpretations with it (possible starting a new sync with it, even).
sdaftuar
commented at 10:48 pm on August 16, 2022:
Fixed
in
src/net_processing.cpp:2742
in
e778d4f995outdated
2625+ // If we were in the middle of headers sync, receiving an empty headers
2626+ // message suggests that the peer suddenly has nothing to give us
2627+ // (perhaps it reorged to our chain). Clear download state for this peer.
2628+ LOCK(peer.m_headers_sync_mutex);
2629+ if (peer.m_headers_sync) {
2630+ peer.m_headers_sync.reset(nullptr);
mzumsande
commented at 7:05 pm on August 16, 2022:
I think we should remove a possible m_headers_presync_stats entry here too.
sdaftuar
commented at 10:48 pm on August 16, 2022:
Nit: this can be std::next(first_chain.begin()) or even just first_chain.begin() + 1.
sdaftuar
commented at 10:48 pm on August 16, 2022:
Fixed.
sipa
commented at 7:45 pm on August 16, 2022:
member
ACK, for the code not written by me here.
I’ve reviewed the code, did several experiments on testnet and mainnet, and observed behavior with -debug=headerssync.
A few nits below, which are non-blocking.
sdaftuar force-pushed
on Aug 16, 2022
ajtowns
commented at 4:49 am on August 17, 2022:
contributor
After seeing the discussion between @ajtowns and @sipa regarding bandwidth optimization involving what locators we send and how we process “skipping ahead” in our headers download (in the situation where one of our peers is serving us headers at a faster rate than another), I think it makes sense to drop any such optimization from this PR. Here’s my thinking:
I think that results in the following observable differences:
more bandwidth usage
if the fastest peer you’re syncing with disconnects while in redownload, it will take some time for the next fastest peer to catch up with your best header and you won’t be making any progress until then
previously, once your best header reached min chainwork, you’d start syncing blocks with all nodes; now you’ll have to wait until you actually catch up with the tip, or until each peer meets min chainwork independently
Those don’t seem too bad to me. :+1:
Alternatively, deploying a p2p protocol change to compress headers message (as has been discussed in the past, such as by omitting prevHash to save 32 bytes, and compressing other fields such as nVersion and nTime)
FWIW, since BIP320 has been adopted by miners (effectively providing 16 extra nonce bits), I think it’s hard to compress nVersion below 2 bytes, and nTime seems to need at least 12 bits; not sure the difference between 3.5 bytes and 8 bytes in the normal case is worth the complexity of dealing with the special cases? Batching headers with a common nBits together would save ~4B per header though, since it only changes every 2016 blocks. (If you did a consensus change to have difficulty change every block, you could still have nBits as a constant value over x,000 blocks describing the minimum allowed difficulty in that period)
could lead to bandwidth savings of close to 50%, which would eliminate most of the overhead introduced by this PR, at least for upgraded software that supports such a proposed new protocol.
Upping the number of headers per message from 2000 to 4000 would halve the number of round-trips necessary (or more? if you increased the message size, 22k headers would fit in a 1MB message at 44B/header, and you’d only need ~35 round trips to sync mainnet instead of ~350…), which I think would cut header sync times down pretty dramatically.
Sjors
commented at 2:04 pm on August 17, 2022:
member
The pre-sync phase looks like this:
I don’t think it’s critical to translate. As long as something is making progress, I suspect most new users will be happy :-)
in
src/headerssync.cpp:314
in
07b068e156outdated
306+ if (m_download_state == State::REDOWNLOAD) {
307+ // During redownload, we will download from the last received header that we stored.
308+ locator.push_back(m_redownload_buffer_last_hash);
309+ }
310+
311+ locator.insert(locator.end(), chain_start_locator.begin(), chain_start_locator.end());
07b068e1568efdb6473f1d1eb4b870eb339c748b: do we need the chain_start_locator at all? We know the peer has the hash in locator, because they sent it to us.
We include it to better handle the case where our peer may reorg during the sync. (This way, they won’t go all the way back to genesis just because we omitted locator entries.)
Sjors
commented at 4:10 pm on August 17, 2022:
member
The LocatorEntries helper function in 5ea6f9a34bf3e8751fbca8f6107e99a9f553e31f was introduced in this push: #25717#event-7174395344. It looks correct to me, but what was the motivation for adding it? I guess we were using a CChain object in an earlier version somewhere?
sipa
commented at 5:27 pm on August 17, 2022:
member
@Sjors An earlier version of the code had net_processing construct a CBlockLocator normally, hand that to HeadersSyncState at creation time which would remember it, and then the locator returned was constructed by concatenating the presync/redownload tip hash with the entries in the locator that was passed down from net_processing at construction.
I later introduced the commit you’re referring to (LocatorEntries etc) for the purpose of being able to do tip header mixing, as it needed a way to compute the locator entries, plus it needed more information about those entries (their height).
Tip header mixing was reverted, and some of the complexity in that commit with it. Arguably we don’t strictly need the LocatorEntries() anymore, it could just use GetLocator and grab the innards of the returned locator rather than using LocatorEntries. It doesn’t hurt though, it’s a bit cleaner I think, and a simpler change later if we want to introduce the header tip mixing again.
Sjors
commented at 7:43 pm on August 17, 2022:
member
it’s a bit cleaner I think
Agree in the context it’s used here.
It doesn’t hurt though
Locators are used in various places. I’m not sure how to confirm that the performance decrease o(1) to o(log(n)) is negligible in all the places it’s used.
sipa
commented at 9:54 pm on August 17, 2022:
member
I’m not sure how to confirm that the performance decrease o(1) to o(log(n)) is negligible in all the places it’s used.
As far as I know, they’re only constructed (a) in order to be sent to the network for requesting headers, or (b) when storing in the wallet or indexes when flushing what the current sync position is. Chasing a few dozen pointers is a sub-microsecond operation. In both scenarios, that’s dwarfed by just network processing times or the disk I/O that also needs to happen.
in
src/headerssync.h:96
in
07b068e156outdated
91+ * little memory even for a very long chain.
92+ *
93+ * - In phase 2 (redownload), keep a lookahead buffer and only accept headers
94+ * from that buffer into the block index (permanent memory usage) once they
95+ * have some target number of verified commitments on top of them. With this
96+ * parametrization, we can achieve a given security target for potential
Yeah, I just like the separation between member functions and data variables, but if this is bothersome I can remove it.
in
src/test/headers_sync_chainwork_tests.cpp:143
in
f4d912a8e4outdated
135+ result = hss->ProcessNextHeaders(headers_batch, false);
136+ BOOST_CHECK(hss->GetState() == HeadersSyncState::State::FINAL);
137+ BOOST_CHECK(result.pow_validated_headers.empty());
138+ // Nevertheless, no validation errors should have been detected with the
139+ // chain:
140+ BOOST_CHECK(result.success);
in
src/test/headers_sync_chainwork_tests.cpp:102
in
f4d912a8e4outdated
97+ // Pretend the first header is still "full", so we don't abort.
98+ auto result = hss->ProcessNextHeaders(headers_batch, true);
99+
100+ // This chain should look valid, and we should have met the proof-of-work
101+ // requirement.
102+ BOOST_CHECK(result.success);
36+ // today. This serves as a memory bound on how many commitments we might
37+ // store from this peer, and we can safely give up syncing if the peer
38+ // exceeds this bound, because it's not possible for a consensus-valid
39+ // chain to be longer than this (at the current time -- in the future we
40+ // could try again, if necessary, to sync a longer chain).
41+ m_max_commitments = 6*(GetAdjustedTime() - chain_start->GetMedianTimePast() + MAX_FUTURE_BLOCK_TIME) / HEADER_COMMITMENT_PERIOD;
In 956109788e, is there a reason to put these assignments in the function body instead of member initializer list? Could make m_minimum_required_work and m_max_commitments const members?
Done, except I left m_max_commitments, just because the calculation and comments are so long that it seems unwieldy to put in the initializer list.
in
src/chain.h:487
in
5ea6f9a34boutdated
476@@ -477,4 +477,10 @@ class CChain
477 CBlockIndex* FindEarliestAtLeast(int64_t nTime, int height) const;
478 };
479480+/** Get a locator for a block index entry. */
481+CBlockLocator GetLocator(const CBlockIndex* index);
mzumsande
commented at 3:14 pm on August 22, 2022:
Having both GetLocator() and a CChain::GetLocator() with identical names, but slightly different behavior (if a nullptr is passed, the latter returns a locator to the tip, the former returns an empty locator) seems like a possible source for errors. Maybe rename CChain::GetLocator() to CChain::GetLocatorToIndexOrTip() (could be done as a follow-up)?
I think it would still be valuable to improve long-term codebase readabilty to prefix the class method GetLocator() to a naming like GetLocatorAtTip() to dissociate well from the new method in the same file (note, we have yet another GetLocator() in src/index/base.cpp). There could be a concern of a validation/chain method misusage inducing some safety bug, especially with the ongoing libbitcoinkernel and assumeutxo works.
Beyond, it could be valuable to document CBlockLocator in src/primitives/block.h, while this data struct is used by GETHEADERS and GETBLOCKS, iirc its usage, the way we construct with the steps until genesis or sanity bounds we enforce on our side like MAX_LOCATOR_SIZE are loosely documented across the codebase. That would make it hard to reuse it for future p2p extensions or for experimentation with other chain-syncing strategies.
All suggestions.
in
src/net_processing.cpp:4259
in
cdc1dd66a3outdated
4119@@ -4111,7 +4120,7 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
4120 // we have a chain with at least nMinimumChainWork), and we ignore
4121 // compact blocks with less work than our tip, it is safe to treat
4122 // reconstructed compact blocks as having been requested.
4123- ProcessBlock(pfrom, pblock, /*force_processing=*/true);
4124+ ProcessBlock(pfrom, pblock, /*force_processing=*/true, true);
mzumsande
commented at 3:54 pm on August 22, 2022:
nit (here and elsewhere): would be nice to have more named arguments, at least in all the places where they are already used for other args.
990@@ -987,7 +991,7 @@ class ChainstateManager
991 * @param[out] new_block A boolean which is set to indicate if the block was first received via this call
992 * @returns If the block was processed, independently of block validity
993 */
994- bool ProcessNewBlock(const std::shared_ptr<const CBlock>& block, bool force_processing, bool* new_block) LOCKS_EXCLUDED(cs_main);
995+ bool ProcessNewBlock(const std::shared_ptr<const CBlock>& block, bool force_processing, bool* new_block, const bool min_pow_checked) LOCKS_EXCLUDED(cs_main);
mzumsande
commented at 4:11 pm on August 22, 2022:
doc above needs an update, plus the ordering is no longer (in, in-out, out) per developer notes.
1002@@ -999,7 +1003,7 @@ class ChainstateManager
1003 * @param[out] state This may be set to an Error state if any error occurred processing them
1004 * @param[out] ppindex If set, the pointer will be set to point to the last new block index object for the given headers
1005 */
1006- bool ProcessNewBlockHeaders(const std::vector<CBlockHeader>& block, BlockValidationState& state, const CBlockIndex** ppindex = nullptr) LOCKS_EXCLUDED(cs_main);
1007+ bool ProcessNewBlockHeaders(const std::vector<CBlockHeader>& block, BlockValidationState& state, const bool min_pow_checked, const CBlockIndex** ppindex = nullptr) LOCKS_EXCLUDED(cs_main);
mzumsande
commented at 4:12 pm on August 22, 2022:
doc above needs an update, plus the ordering is no longer (in, in-out, out) per developer notes.
sdaftuar
commented at 0:30 am on August 23, 2022:
member
Pushed two commits to address the following:
If a peer has a chain whose work is less than minchainwork, but is a subset of the main chain (m_best_header or chainActive), then we wouldn’t ever process the peer’s headers and update pindexBestKnownBlock, preventing us from being able to use the peer as a download peer for the portion of the chain they have. Fix this by checking to see if the last header in a headers message is (a) known to us and (b) an ancestor of our best header or tip, and if so, bypass the DoS checks (which doesn’t cost us anything because processing these headers wouldn’t use more memory anyway).
This change means that for a node which has synced the honest chain, it’s no longer possible for an attacker to try to achieve a collision during REDOWNLOAD by feeding us headers we already have during PRESYNC. This also means that we would not initiate a low-work headers sync until receiving a headers message that contains something new, which I think is conceptually simpler than what was happening previously.
I noticed a bug in ProcessHeadersMessage() where, in a couple of places, we would be operating on the headers received off the network instead of the headers returned by the headers sync module. The second commit tries to clean this logic up in a less error-prone way by eliminating the headers_to_process pointer, in favor of just changing the headers vector directly when previously_downloaded_headers are returned.
Finally, I’m still contemplating whether any logic change is warranted if we receive a low-work header via compact block (right now we would ignore them, I think this may be fine but since this is a behavior change it’s worth considering whether this could be problematic at all).
(Will squash these commits into place once reviewers have had a chance to take a look)
fanquake
commented at 7:54 am on August 23, 2022:
member
@glozow The Qt signals framework can’t enqueue enum values, unfortunately. The alternative is creating some wrapper class etc, but that seemed overkill. I agree this is suboptimal, but I think cleaner alternatives aren’t worth the effort here.
The rule against difficulty adjustments changing by more than a factor of 4 can
be helpful for anti-DoS measures in contexts where we lack a full headers
chain, so expose this functionality separately and in the narrow case where we
only know the height, new value, and old value.
Includes fuzz test by Martin Zumsande.
1d4cfa4272
Add bitdeque, an std::deque<bool> analogue that does bit packing.84852bb6bb
in
src/net_processing.cpp:2574
in
96140e9479outdated
glozow
commented at 4:48 pm on August 23, 2022:
member
Have stared at each commit enough to convince myself this is safe. Tested this a few times on mainnet and testnet with some sanity check assertions thrown in. I don’t know much about qt but the code looks correct, and when testing, both gui and debug.log seemed to display information at a reasonable cadence.
sdaftuar force-pushed
on Aug 23, 2022
sdaftuar force-pushed
on Aug 23, 2022
Add functions to construct locators without CChain
This introduces an insignificant performance penalty, as it means locator
construction needs to use the skiplist-based CBlockIndex::GetAncestor()
function instead of the lookup-based CChain, but avoids the need for
callers to have access to a relevant CChain object.
ed470940cd
sdaftuar force-pushed
on Aug 23, 2022
sdaftuar
commented at 8:10 pm on August 23, 2022:
member
Addressed review feedback and included some changes to further clean up the logic in ProcessHeadersMessage(). Also, I included a new test for large reorg handling.
nit: Could an implicit bool to int conversion be avoided?
hebasto
commented at 11:45 am on August 24, 2022:
member
Tested bbef38fdac90a03020ad70cbe6b97fbf09fccbc6 on the GUI:
in
src/net_processing.cpp:598
in
1fca7ed49eoutdated
587@@ -581,18 +588,59 @@ class PeerManagerImpl final : public PeerManager
588589 void ProcessOrphanTx(std::set<uint256>& orphan_work_set) EXCLUSIVE_LOCKS_REQUIRED(cs_main, g_cs_orphans)
590 EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex);
591- /** Process a single headers message from a peer. */
592+ /** Process a single headers message from a peer.
593+ *
594+ * @param[in] pfrom CNode of the peer
595+ * @param[in] peer The peer sending us the headers
596+ * @param[in] headers The headers received. Note that this may be modified within ProcessHeadersMessage.
Hmm, interesting point. I think the intention is not to think of the headers in the vector after calling ProcessHeaderMessage as the output of the function, but rather just that the caller shouldn’t assume anything about what is in the vector. So I think it’s better to not have the docs say “[in,out]”, given that understanding?
With the recent changes in the net_processing, I think it’s easier to think about it as “reference to a list of headers to be processed, and which may be modified”. All the caller would do with an output argument is overwrite the passed-in headers variable anyway.
Why modify it if you don’t want the caller to use it afterwards?
Is it only modified by the headers.swap(result.pow_validated_headers) in IsContinuationOfLowWorkHeadersSync() and all callers ignore the result? If yes, then that swap() can be removed and the parameters made const?
Edit: the callers of IsContinuationOfLowWorkHeadersSync() don’t ignore the result, I was confused, sorry. Maybe document what is expected as in and what is returned as out for the headers parameter in both PeerManagerImpl::ProcessHeadersMessage() and IsContinuationOfLowWorkHeadersSync().
@vasild For IsContinuationOfLowWorkHeadersSync, the comment currently says:
0 * [@param](/bitcoin-bitcoin/contributor/param/)[in,out] headers The headers to be processed.
1 * [@return](/bitcoin-bitcoin/contributor/return/) True if the passed in headers were successfully processed
2 * as the continuation of a low-work headers sync in progress;
3 * false otherwise.
4 * If false, the passed in headers will be returned back to
5 * the caller.
6 * If true, the returned headers may be empty, indicating
7 * there is no more work for the caller to do; or the headers
8 * may be populated with entries that have passed anti-DoS
9 * checks (and therefore may be validated for block index
10 * acceptance by the caller).
For ProcessHeadersMessage, the comment reads:
0 * [@param](/bitcoin-bitcoin/contributor/param/)[in] headers The headers received. Note that this may be modified within ProcessHeadersMessage.
Can you clarify what documentation you think is missing?
The headers argument to ProcessHeadersMessage should probably be of type std::vector<CBlockHeader>&&, making the caller indicate that the referenced vector isn’t going to be used anymore. Callers can pass something in with std::move(headers). The compact block call site can turn into return ProcessHeadersMessage(pfrom, *peer, {cmpctblock.header}, /*via_compact_block=*/true); then even, without temporary headers variable.
sdaftuar
commented at 11:58 pm on August 25, 2022:
@sdaftuar, I was just briefly looking around and I missed the explanations in IsContinuationOfLowWorkHeadersSync()@return. The comment in ProcessHeadersMessage() is too scarce - “may be modified”, but in which way?
Anyway, too minor and I guess the @sipa’s suggestion above has improved it enough.
in
src/net_processing.cpp:616
in
1fca7ed49eoutdated
610 * occasional non-connecting header (this can happen due to BIP 130 headers
611 * announcements for blocks interacting with the 2hr (MAX_FUTURE_BLOCK_TIME) rule). */
612 void HandleFewUnconnectingHeaders(CNode& pfrom, Peer& peer, const std::vector<CBlockHeader>& headers);
613 /** Return true if the headers connect to each other, false otherwise */
614 bool CheckHeadersAreContinuous(const std::vector<CBlockHeader>& headers) const;
615+ /** Try to continue a low-work headers sync that has already begun.
1fca7ed49eca6c2cd71d37415fa2d419779cf212 Given that this calls ProcessNextHeaders() we should probably repeat the requirement for that function:
Assumes the caller has already verified the headers connect, and has checked that each header satisfies proof-of-work target included in the header
sdaftuar
commented at 11:23 pm on August 24, 2022:
Done.
in
src/net_processing.cpp:617
in
1fca7ed49eoutdated
614 bool CheckHeadersAreContinuous(const std::vector<CBlockHeader>& headers) const;
615+ /** Try to continue a low-work headers sync that has already begun.
616+ * @param[in] peer The peer we're syncing with.
617+ * @param[in] pfrom CNode of the peer
618+ * @param[in,out] headers The headers to be processed.
619+ * @return True if the passed in headers were successfully processed; false otherwise.
1fca7ed49eca6c2cd71d37415fa2d419779cf212 to avoid using the ambiguous term “processed” twice, maybe replace “were successfully processed” with “passed the checks in ValidateAndStoreHeadersCommitments” or “connect and their difficulty adjustment is within bounds”.
sdaftuar
commented at 11:23 pm on August 24, 2022:
Reworked this comment to be clearer.
in
src/headerssync.h:151
in
1fca7ed49eoutdated
137+ };
138+
139+ /** Process a batch of headers, once a sync via this mechanism has started
140+ *
141+ * received_headers: headers that were received over the network for processing.
142+ * Assumes the caller has already verified the headers connect,
1fca7ed49eca6c2cd71d37415fa2d419779cf212ProcessNextHeaders calls ValidateAndStoreHeadersCommitments which checks that headers connect. So does it really assume it?
That’s a fair point; I think what I was thinking here is that headers within a headers message that don’t connect aren’t going to be well-handled by this logic, because that is a level of brokenness that we’d want to detect at the net_processing level and just disconnect the peer (certainly it’s not a type of failure that we would handle by issuing another getheaders and hoping for a different response).
The logic does (or should!) correctly deal with a case where the first header in a message doesn’t connect with the last header in a prior message, which is not necessarily indicative of a broken peer (it could happen due to the peer reorging, for instance). In that type of situation, we want to return a failure to the caller that indicates the caller could try again with a new headers sync from a different starting point.
So in total, I think the idea that the caller should be checking for this error is still the right idea to get across, but perhaps this comment is slightly misleading.
headers within a headers message that don’t connect
I’ve been assuming that in the source code comments “connect” means the first headers fits onto something we have and “continuous” means all headers inside the package connect (with no forks?). For each function we should probably be clear which of the two we care about.
The logic does (or should!) correctly deal with a case where the first header in a message doesn’t connect with the last header in a prior message,
1fca7ed49eca6c2cd71d37415fa2d419779cf212 probably worth logging if this does go wrong.
sdaftuar
commented at 11:24 pm on August 24, 2022:
Done.
in
src/headerssync.h:166
in
1fca7ed49eoutdated
161+ /** Issue the next GETHEADERS message to our peer.
162+ *
163+ * This will return a locator appropriate for the current sync object, to continue the
164+ * synchronization phase it is in.
165+ */
166+ CBlockLocator MakeNextHeadersRequest() const;
1fca7ed49eca6c2cd71d37415fa2d419779cf212: NextHeadersRequestLocator might be a better name
sdaftuar
commented at 11:24 pm on August 24, 2022:
Agreed, done.
in
src/net_processing.cpp:622
in
1fca7ed49eoutdated
613 /** Return true if the headers connect to each other, false otherwise */
614 bool CheckHeadersAreContinuous(const std::vector<CBlockHeader>& headers) const;
615+ /** Try to continue a low-work headers sync that has already begun.
616+ * @param[in] peer The peer we're syncing with.
617+ * @param[in] pfrom CNode of the peer
618+ * @param[in,out] headers The headers to be processed.
1fca7ed49eca6c2cd71d37415fa2d419779cf212: is there any downside to just having a const headers_in and a separate headers_out param? Inside IsContinuationOfLowWorkHeadersSync you’re not editing headers in place anyway, using swap instead.
This was the result of a couple iterations that I went through with @sipa. I believe @sipa pointed out to me that we could push the headers manipulation down another level into the headers sync object itself, and then we could save an allocation – but I was going to save further refinements of this code to a future PR at this point.
in
src/net_processing.cpp:2659
in
1fca7ed49eoutdated
2656+ }
2657+
2658+ // Before we do any processing, make sure these pass basic sanity checks.
2659+ // We'll rely on headers having valid proof-of-work further down, as an
2660+ // anti-DoS criteria (note: this check is required before passing any
2661+ // headers/ into HeadersSyncState).
sdaftuar
commented at 11:25 pm on August 24, 2022:
Fixed.
in
src/net_processing.cpp:2775
in
1fca7ed49eoutdated
2674+ // If we're in the middle of headers sync, let it do its magic.
2675+ bool have_headers_sync = false;
2676+ {
2677+ LOCK(peer.m_headers_sync_mutex);
2678+
2679+ already_validated_work = IsContinuationOfLowWorkHeadersSync(peer, pfrom, headers);
1fca7ed49eca6c2cd71d37415fa2d419779cf212 This can be true if we’re still still in the PRESYNC state, even though we haven’t checked against the minimum work yet. In that case this function returns early, so it’s fine. But a clarifying comment would be good here.
sdaftuar
commented at 11:25 pm on August 24, 2022:
Updated the comments around this variable.
in
src/net_processing.cpp:2668
in
1fca7ed49eoutdated
2665 }
26662667 const CBlockIndex *pindexLast = nullptr;
26682669+ // We'll set already_validated_work to true if the headers-sync logic
2670+ // returns headers for us to process, to bypass the minimum work check
1fca7ed49eca6c2cd71d37415fa2d419779cf212 Maybe say “minimum total work check”, so it’s not confused with a minimal check on the work.
sdaftuar
commented at 11:26 pm on August 24, 2022:
Reworked this comment.
in
src/net_processing.cpp:2757
in
1fca7ed49eoutdated
2658+ // Before we do any processing, make sure these pass basic sanity checks.
2659+ // We'll rely on headers having valid proof-of-work further down, as an
2660+ // anti-DoS criteria (note: this check is required before passing any
2661+ // headers/ into HeadersSyncState).
2662+ if (!CheckHeadersPoW(headers, m_chainparams.GetConsensus(), peer)) {
2663+ Misbehaving(peer, 100, "invalid proof-of-work");
Will add. Also, you just made me realize that the anti-DoS work calculation to avoid processing compact blocks that are low-work doesn’t first verify the proof-of-work on the header before doing the work check. After giving this more thought, I don’t think it’s a problem, because if the claimed work is low we ignore it, and if the claimed work is high then we might try to validate it, which would catch any problems – but wanted to mention it in case I am overlooking something.
in
src/net_processing.cpp:4157
in
1fca7ed49eoutdated
4153@@ -3913,7 +4154,8 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
4154 // the peer if the header turns out to be for an invalid block.
4155 // Note that if a peer tries to build on an invalid chain, that
4156 // will be detected and the peer will be disconnected/discouraged.
4157- return ProcessHeadersMessage(pfrom, *peer, {cmpctblock.header}, /*via_compact_block=*/true);
4158+ std::vector<CBlockHeader> dummy{cmpctblock.header};
1fca7ed49eca6c2cd71d37415fa2d419779cf212: it’s not really a dummy is it? E.g. the proof of work gets checked by ProcessHeadersMessage()
sdaftuar
commented at 11:27 pm on August 24, 2022:
Fixed.
in
src/headerssync.cpp:19
in
1fca7ed49eoutdated
14+//! Store a commitment to a header every HEADER_COMMITMENT_PERIOD blocks.
15+constexpr size_t HEADER_COMMITMENT_PERIOD{584};
16+
17+//! Only feed headers to validation once this many headers on top have been
18+//! received and validated against commitments.
19+constexpr size_t REDOWNLOAD_BUFFER_SIZE{13959};
1fca7ed49eca6c2cd71d37415fa2d419779cf212: so this doesn’t need to be a multiple of HEADER_COMMITMENT_PERIOD? Maybe mention that it’s approximately 24 commitment periods worth of headers, i.e. 24 bits of security (? Or ~9 more because an attacker has to account for every possible offset)
sdaftuar
commented at 11:27 pm on August 24, 2022:
I’m not sure “bits of security” is quite the right way to think about this, but I added a comment mentioning this is about 23.9 commitment bits that will fit in the buffer.
Not necessarily true I think, such as if we are destroying the object early (such as due to peer disconnection, or being in a unit test environment, etc).
Wouldn’t it be safer to always require a call to Finalize() in these cases? Alternatively, to have a State::NEW where we know nothing has been allocated yet.
I believe clear() does not actually free memory, but the swap() here does. (May not matter much since the caller will just be deleting us anyway, but seems better to free resources as soon as we can.)
This seems based on the (somewhat surprising) std::vector<T>::clear() behavior which does not release memory, just sets the size of the used area of it to 0. That doesn’t apply here though, because std::deque is not an std::vector and I don’t think it exhibits that behavior.
Also, even if it was, I think m_redownloaded_headers = std::deque<CompressedHeader>{}; is cleaner (since C++11 move semantics that’s equivalent to that swap statement, but a bit more readable).
Seems I’m wrong; std::deque<T>::clear() does not guarantee memory is released.
sdaftuar
commented at 11:45 pm on August 24, 2022:
Rewrote this as m_redownloaded_headers = std::deque<CompressedHeader>{};
in
src/headerssync.cpp:90
in
1fca7ed49eoutdated
85+ if (full_headers_message || m_download_state == State::REDOWNLOAD) {
86+ // A full headers message means the peer may have more to give us;
87+ // also if we just switched to REDOWNLOAD then we need to re-request
88+ // headers from the beginning.
89+ ret.request_more = true;
90+ } else {
1fca7ed49eca6c2cd71d37415fa2d419779cf212: calling Finalize() here is safe, because we’ve already popped the headers we need into ret.
in
src/headerssync.h:229
in
1fca7ed49eoutdated
224+ *
225+ * Any peer giving us more headers than this will have its sync aborted. This serves as a
226+ * memory bound on m_header_commitments. */
227+ uint64_t m_max_commitments{0};
228+
229+ /** Store the latest header received while in PRESYNC */
1fca7ed49eca6c2cd71d37415fa2d419779cf212 or the hash of m_chain_start.
sdaftuar
commented at 11:27 pm on August 24, 2022:
Updated comment.
in
src/headerssync.h:184
in
1fca7ed49eoutdated
179+ * On failure, this invokes Finalize() and returns false.
180+ */
181+ bool ValidateAndStoreHeadersCommitments(const std::vector<CBlockHeader>& headers);
182+
183+ /** In PRESYNC, process and update state for a single header */
184+ bool ValidateAndProcessSingleHeader(const CBlockHeader& previous, const
1fca7ed49eca6c2cd71d37415fa2d419779cf212: the previous argument is a bit pointless, since it’s a member variable.
sdaftuar
commented at 11:27 pm on August 24, 2022:
Eliminated the extra argument.
in
src/headerssync.h:243
in
1fca7ed49eoutdated
238+ std::deque<CompressedHeader> m_redownloaded_headers;
239+
240+ /** Height of last header in m_redownloaded_headers */
241+ int64_t m_redownload_buffer_last_height{0};
242+
243+ /** Hash of last header in m_redownloaded_headers (we have to cache it
1fca7ed49eca6c2cd71d37415fa2d419779cf212 or the hash of m_chain_start.
sdaftuar
commented at 11:27 pm on August 24, 2022:
Updated comment.
in
src/headerssync.cpp:201
in
1fca7ed49eoutdated
195+ // Add a commitment.
196+ m_header_commitments.push_back(m_hasher(current.GetHash()) & 1);
197+ if (m_header_commitments.size() > m_max_commitments) {
198+ // The peer's chain is too long; give up.
199+ // It's possible the chain grew since we started the sync; so
200+ // potentially we could succeed in syncing the peer's chain if we
1fca7ed49eca6c2cd71d37415fa2d419779cf212: in the case of IBD this is just a remote theoretical possibility, right? But for a shorter catchup there’s at least a less-remote chance someone produced lots of blocks with timestamps progressing only 1 second per 6 blocks, e.g. as part of an going timewarp attack.
I think even in that scenario, this code ought to succeed, because we establish the minimum work requirement to leave this logic at the start of a sync, and not as we go. So if a peer had a high-enough-work chain at the start of our sync, then this calculation should correctly capture the number of valid headers they might have to satisfy that work requirement.
in
src/headerssync.cpp:256
in
1fca7ed49eoutdated
251+ // Also, don't check commitments once we've gotten to our target blockhash;
252+ // it's possible our peer has extended its chain between our first sync and
253+ // our second, and we don't want to return failure after we've seen our
254+ // target blockhash just because we ran out of commitments.
255+ if (!m_process_all_remaining_headers && next_height % HEADER_COMMITMENT_PERIOD == m_commit_offset) {
256+ bool commitment = m_hasher(header.GetHash()) & 1;
02022-08-25T14:41:14.204000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_zotpggyy
12022-08-25T14:41:14.791000Z TestFramework (ERROR): Assertion failed
2Traceback (most recent call last):
3File"/home/sdaftuar/ccl-bitcoin/2022-02-headers-dos-prevention/test/functional/test_framework/test_framework.py", line 133, in main
4 self.run_test()
5File"/home/sdaftuar/ccl-bitcoin/2022-02-headers-dos-prevention/test/functional/p2p_unrequested_blocks.py", line 105, in run_test
6 assert_equal(self.check_hash_in_chaintips(self.nodes[1], blocks_h2[1].hash), False)
7File"/home/sdaftuar/ccl-bitcoin/2022-02-headers-dos-prevention/test/functional/test_framework/util.py", line 56, in assert_equal
8 raise AssertionError("not(%s)"%" == ".join(str(arg) for arg in (thing1, thing2) + args))
9AssertionError: not(True == False)
102022-08-25T14:41:14.842000Z TestFramework (INFO): Stopping nodes
112022-08-25T14:41:14.996000Z TestFramework (WARNING): Not cleaning up dir /tmp/bitcoin_func_test_zotpggyy
122022-08-25T14:41:14.996000Z TestFramework (ERROR): Test failed. Test logging available at /tmp/bitcoin_func_test_zotpggyy/test_framework.log
132022-08-25T14:41:14.996000Z TestFramework (ERROR):
142022-08-25T14:41:14.996000Z TestFramework (ERROR): Hint: Call /home/sdaftuar/ccl-bitcoin/2022-02-headers-dos-prevention/test/functional/combine_logs.py '/tmp/bitcoin_func_test_zotpggyy' to consolidate all logs
152022-08-25T14:41:14.996000Z TestFramework (ERROR):
162022-08-25T14:41:14.997000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log.172022-08-25T14:41:14.997000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues
182022-08-25T14:41:14.997000Z TestFramework (ERROR):
in
src/net_processing.cpp:4436
in
01a9c4dc28outdated
4345@@ -4336,8 +4346,14 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
4346 // which peers send us compact blocks, so the race between here and
4347 // cs_main in ProcessNewBlock is fine.
4348 mapBlockSource.emplace(hash, std::make_pair(pfrom.GetId(), true));
4349+
4350+ // Check work on this block against our anti-dos thresholds.
4351+ const CBlockIndex* prev_block = m_chainman.m_blockman.LookupBlockIndex(pblock->hashPrevBlock);
4352+ if (prev_block && prev_block->nChainWork + CalculateHeadersWork({pblock->GetBlockHeader()}) >= GetAntiDoSWorkThreshold()) {
4353+ min_pow_checked = true;
01a9c4dc2869708957ce3d9d4a391740e074363c: so in the else case min_pow_checked would remain false and thus trigger the BLOCK_HEADER_LOW_WORK condition, can we cover that in a test?
Sjors
commented at 9:15 am on August 25, 2022:
member
01a9c4dc2869708957ce3d9d4a391740e074363c: I’m not sure how robust the min_pow_checked “attestation” is against breaking things in the future (it looks correct now). It’s important to have test coverage for low-work headers trying to sneak in via the various call sites. This commit adds a few, so that’s good.
in
src/headerssync.h:124
in
bb60d93657outdated
120@@ -121,6 +121,12 @@ class HeadersSyncState {
121 /** Return the height reached during the PRESYNC phase */
122 int64_t GetPresyncHeight() const { return m_current_height; }
123124+ /** Return the block timestamp of the last block received during the PRESYNC phase. */
Sjors
commented at 11:51 am on August 25, 2022:
member
tACKacee94b4fdb1a28db1c217d82ca0611d7ba5a56d
did not review the deque implementation and fuzzer code
did not check the actual per peer memory usage
only lightly reviewed the stats logging code in bb60d93657df10702a48ea3094d2eee732a80e52
sdaftuar
commented at 3:23 pm on August 25, 2022:
member
Addressed latest review feedback from @sjors, unsquashed branch is here for comparison.
sdaftuar force-pushed
on Aug 25, 2022
mzumsande approved
mzumsande
commented at 6:26 pm on August 25, 2022:
contributor
ACKbae408fc87b3111cf93d26db66f1947430b04f34
I reviewed the code (minus the bitdeque implementation, where I just ran the extensive fuzz test for a while) and tested this multiple times on mainnet, signet and testnet, both with bitcoind and the GUI.
in
src/net_processing.cpp:2489
in
0f24c9a924outdated
2484+ // process the headers using it as normal.
2485+ return IsContinuationOfLowWorkHeadersSync(peer, pfrom, headers);
2486+ } else {
2487+ LogPrint(BCLog::NET, "Ignoring low-work chain (height=%u) from peer=%d\n", chain_start_header->nHeight + headers.size(), pfrom.GetId());
2488+ // Since this is a low-work headers chain, no further processing is required.
2489+ std::vector<CBlockHeader>().swap(headers);
sdaftuar
commented at 11:59 pm on August 25, 2022:
Done.
in
src/headerssync.cpp:54
in
0f24c9a924outdated
49+ * required to guarantee that we won't reuse this object with the same
50+ * SaltedTxidHasher for another sync. */
51+void HeadersSyncState::Finalize()
52+{
53+ Assume(m_download_state != State::FINAL);
54+ m_header_commitments.clear();
If you want to guarantee memory is released here, use m_headers_commitments = {}; here as well.
sdaftuar
commented at 11:59 pm on August 25, 2022:
Done.
in
src/net_processing.cpp:2552
in
bae408fc87outdated
2547+ // Note: we could advance to the last header in this set that is
2548+ // known to us, rather than starting at the first header (which we
2549+ // may already have). But we actually might have all the headers in
2550+ // this set already, because headers sync can be imprecise when a
2551+ // peer has to serve us a long chain (due to imprecision in the way
2552+ // locators are calculated). So philosophically, if we wanted to
I think this comment is outdated now. We’re actually not starting low-work headers sync until the peer gives a headers message which contains at least one header not in m_best_header/m_active_chain already.
1003@@ -991,10 +1004,11 @@ class ChainstateManager
1004 *
1005 * @param[in] block The block we want to process.
1006 * @param[in] force_processing Process this block even if unrequested; used for non-network block sources.
1007+ * @param[in] min_pow_checked True if proof-of-work anti-DoS checks have been done by caller for headers chain
Maybe point out that this only affects situations where a new header is being added along with the block. If we already had the header, this parameter has no effect.
sdaftuar
commented at 11:59 pm on August 25, 2022:
Done.
sipa
commented at 9:18 pm on August 25, 2022:
member
ACKbae408fc87b3111cf93d26db66f1947430b04f34, after squashing (with the caveat that some of the code changes are mine).
I’ve re-reviewed all commits, and separately reviewed code changes since my last review (by comparing with a rebased version of the old commit hash). I’ve also done several more synchronizations from scratch, on mainnet and testnet.
Only non-blocking nits below.
sdaftuar force-pushed
on Aug 25, 2022
sdaftuar force-pushed
on Aug 26, 2022
sdaftuar
commented at 0:07 am on August 26, 2022:
member
Updated to address @sipa’s latest comments; unsquashed version is here.
sipa
commented at 3:57 am on August 26, 2022:
member
ACKf0dae7c6fa54d4d00489da02eebfa3e4ef40cb63 (for the parts that weren’t authored by me).
Doing a testnet sync, the headerssync logs look like:
02022-08-26T04:50:14Z [headerssync] Initial headers sync started with peer=0: height=0, max_commitments=3748027, min_work=00000000000000000000000000000000000000000000064728c7be6fe4b2f961
12022-08-26T04:52:45Z [headerssync] Initial headers sync started with peer=3: height=0, max_commitments=3748029, min_work=00000000000000000000000000000000000000000000064728c7be6fe4b2f961
22022-08-26T04:56:48Z [headerssync] Initial headers sync transition with peer=0: reached sufficient work at height=2144000, redownloading from height=032022-08-26T04:58:35Z [headerssync] Initial headers sync transition with peer=3: reached sufficient work at height=2144000, redownloading from height=042022-08-26T05:02:25Z [headerssync] Initial headers sync complete with peer=0: releasing all at height=2144000 (redownload phase)
which isn’t very spammy, so I think it would make sense to merge this back into BCLog::NET in a followup. (The main rationale being that if you’re trying to debug weird p2p behaviour, better to see what’s going on with header syncing in case that happens to be the cause, without having to realise the possiblity after the fact then start debugging again later when after adding the extra debug category)
in
src/net_processing.cpp:2791
in
9a0d196a9coutdated
2706+ // or not.
2707+ if (headers.empty()) {
2708+ return;
2709+ }
2710+
2711+ have_headers_sync = !!peer.m_headers_sync;
Not quite sure where to add this note; but having dropped the optimisation, this results in slightly weird behaviour: if you’re doing headers sync with two peers, A and B, and A completes redownload and then syncs up to the tip, then A, B and all other nodes will report synced_headers as matching the tip, and all nodes will start downloading blocks, but B will continue to download every header until it re-confirms that min work has been reached. Adding m_redownload_buffer_last_height to getpeerinfo makes this behaviour more easily observable.
Probably not worth optimising for this particular case rather than doing something more general and with greater impact, though.
in
src/net_processing.cpp:2542
in
f0dae7c6faoutdated
2537+ // before we'll store it)
2538+ arith_uint256 minimum_chain_work = GetAntiDoSWorkThreshold();
2539+
2540+ // Avoid DoS via low-difficulty-headers by only processing if the headers
2541+ // are part of a chain with sufficient work.
2542+ if (total_work < minimum_chain_work) {
Perhaps add an exception here for peers with some NetPermissionsFlag set (reuse noban or download?) to allow you to easily avoid downloading headers twice if you have a trusted node to connect to?
ajtowns
commented at 3:52 pm on August 26, 2022:
contributor
utACKf0dae7c6fa54d4d00489da02eebfa3e4ef40cb63 – logic review more than code review
sdaftuar
commented at 6:27 pm on August 26, 2022:
member
@ajtowns I think both of your suggestions – dropping the HEADERSSYNC log category in favor of using NET, and dropping the anti-DoS check for NoBan peers – make sense, but I think both of those improvements are minor at this point and I’d prefer to postpone to a followup PR, just to minimize additional review burden for this.
Creating a checklist so that these suggestions don’t get forgotten:
I wonder if it wouldn’t be better to make m_max_commitments function of right now when we evaluate this check (e.g GetMaxCommitments(NodeClock::time_point_now), rather than when we initiate headers sync. One could argue that the most-work consensus-valid chain could be ignored by this logic as the chain might have grown since the start of the sync and is artificially sorted out by DoS mitigations. I concede, I don’t think it matters in practice as if we reach the rate of 6 blocks/second something is already wrong, however I think ideally the consensus outcome should be abstracted out from our headers/blocks syncing strategy.
Is there a plausible scenario where this would matter? We also specify at the start of sync the minimum work required; fixing the maximum chain length just has the effect of requiring that a peer’s chain has enough work at the time the sync is initiated to pass anti-DoS checks.
Failure to sync because of a low-work chain will not cause a peer to be dropped, so if their chain grows later we’ll try syncing again. And in the specific case of a peer announcing many blocks to us (like approaching the 6 blocks/second theoretical limit), then this should not take long at all, because as soon as the sync ends and we get one more block announcement, we’re resume headers sync.
As far as I can think, I don’t see a plausible scenario where it matters. In fine there is the space of the historical chain where the 6 blocks/second has not been observed, and as such servicing as a additional chain length buffer to correct discrepancies between HeadersSyncState object initialization time point and a right now, I think. From my understanding, the maximum chain length is a protection targeting memory DoS of m_header_commitments, enforcing a constraint on the number of blocks allowed to satisfy the minimum work required.
I can’t imagine this matters, because it would require the peer to have a chain which is invalid at the time the sync starts, which becomes valid while we’re busy learning about it. If that were the case, we’ll still synchronize it later.
in
src/net_processing.cpp:2374
in
f0dae7c6faoutdated
2369+ }
2370+
2371+ // Are these headers connected to each other?
2372+ if (!CheckHeadersAreContinuous(headers)) {
2373+ Misbehaving(peer, 20, "non-continuous headers sequence");
2374+ return false;
I’m not sure why a misbehaving score of 20 is assigned here, considering that if CheckHeadersPow returns false in ProcessHeadersMessage(), an additional misbehaving score of 100 is assigned there (L2757), as such beyond DISCOURAGEMENT_THRESHOLD. I’m don’t know if it makes sense.
sdaftuar
commented at 12:02 pm on August 29, 2022:
Thanks, I think this may have been an oversight when I added the Misbehaving() to the caller. Looks like the score of 20 is how things worked historically; I guess I’ll just leave that unchanged but remove the Misbehaving() from the caller.
in
src/headerssync.cpp:263
in
f0dae7c6faoutdated
258+ LogPrint(BCLog::HEADERSSYNC, "Initial headers sync aborted with peer=%d: commitment overrun at height=%i (redownload phase)\n", m_id, next_height);
259+ // Somehow our peer managed to feed us a different chain and
260+ // we've run out of commitments.
261+ return false;
262+ }
263+ bool commitment = m_hasher(header.GetHash()) & 1;
Considering the 1-bit commitment, I wonder how hard it would be for an attacker to feed us a high-work chain of headers during PRESYNC, to substitute a second low-work chain of headers during REDOWNLOAD satisfying the exact same commitment pattern.
From browsing the simulation, the 1 bit commitment is assumed as a system parameter at the beginning of demonstration, how which the commitment parameters like frequency of commitments and size of redownload buffer are derived. I wonder if N-bit commitment could be more information efficient, though we’re free to explore again in the future if the commitment storage memory usage becomes an issue (probably not).
We’ve analyzed half a dozen variations before ending up with this approach, including:
Just (2000/N) bits every 2000 headers
1 bit every N headers, with a fixed offset
1 bit every N headers, with a random offset (approach from this PR)
1 bit every N headers, with an offset chosen such that it minimizes the attacker’s power
A 1 in N probability for every header to have a 1-bit commitment
A 1/N-bit commitment for every header, using an entropy coder.
(1) and (2) are equivalent, as we’re only going to verify commitments after blobs of 2000 anyway. (3) is better, and destroys the equivalence with (1) because the attacker doesn’t know when to start. (4) is very slightly worse than (3), but still significantly better than (1) and (2). (5) and (6) are significantly worse.
If you have other ideas for commitment structures, I’m happy to try simulating them.
ariard
commented at 2:38 am on August 29, 2022:
member
Will still review post-merge if this PR is evaluated as mature enough to land soon.
Utilize anti-DoS headers download strategy
Avoid permanently storing headers from a peer, unless the headers are part of a
chain with sufficiently high work. This prevents memory attacks using low-work
headers.
Designed and co-authored with Pieter Wuille.
551a8d957c
Require callers of AcceptBlockHeader() to perform anti-dos checks
In order to prevent memory DoS, we must ensure that we don't accept a new
header into memory until we've performed anti-DoS checks, such as verifying
that the header is part of a sufficiently high work chain. This commit adds a
new argument to AcceptBlockHeader() so that we can ensure that all call-sites
which might cause a new header to be accepted into memory have to grapple with
the question of whether the header is safe to accept, or needs further
validation.
This patch also fixes two places where low-difficulty-headers could have been
processed without such validation (processing an unrequested block from the
network, and processing a compact block).
Credit to Niklas Gögge for noticing this issue, and thanks to Sjors Provoost
for test code.
ed6cddd98e
Reduce spurious messages during headers sync
Delay sending SENDHEADERS (BIP 130) message until we know our peer's best
header's chain has more than nMinimumChainWork. This reduces inadvertent
headers messages received during initial headers sync due to block
announcements, which throw off our sync algorithm.
83c6a0c524
Add unit test for HeadersSyncState0b6aa826b5
Test headers sync using minchainwork threshold150a5486db
Expose HeadersSyncState::m_current_height in getpeerinfo()03712dddfb
Track headers presync progress and log it355547334f
Test large reorgs with headerssync logic93eae27031
Make validation interface capable of signalling header presync
This makes a number of changes:
- Get rid of the verification_progress argument in the node interface
NotifyHeaderTip (it was always 0.0).
- Instead of passing a CBlockIndex* in the UI interface's NotifyHeaderTip,
send separate height, timestamp fields. This is becuase in headers presync,
no actual CBlockIndex object is available.
- Add a bool presync argument to both of the above, to identify signals
pertaining to the first headers sync phase.
376086fc5a
Emit NotifyHeaderTip signals for pre-synchronization progress738421c50f
ui: show header pre-synchronization progress3add234546
sdaftuar force-pushed
on Aug 29, 2022
sdaftuar
commented at 12:13 pm on August 29, 2022:
member
Updated to drop an extra Misbehaving() call that was unnecessary; unsquashed version for comparing against prior branch is here.
Sjors
commented at 4:53 pm on August 29, 2022:
member
Agree that merging the new log into NET makes sense. In fact it’s useful to look at both of them, e.g. to see that sendheaders is sent at the right time.
mzumsande
commented at 8:06 pm on August 29, 2022:
contributor
re-ACK3add23454624c4c79c9eebc060b6fbed4e3131a7
sipa
commented at 11:38 am on August 30, 2022:
member
re-ACK3add23454624c4c79c9eebc060b6fbed4e3131a7
glozow
commented at 12:33 pm on August 30, 2022:
member
ACK3add234546
I can’t say I have a complete mental model of every possible interaction during IBD, end of IBD, reorgs, etc., but whiteboarded the syncing logic to the best of my ability. The comments are very helpful, and I reviewed the code to verify that it’s doing what the comments say it does.
Tested on {mainnet, signet, testnet} {with, without} gui to manually check getpeerinfo “presynced_headers,” logging, ui
Reviewed the code, a few notes if they help anyone. I did look at the bitdeque, though lightly. Fuzz looks pretty comprehensive to me.
in
src/validation.cpp:3721
in
355547334foutdated
3716+ AssertLockNotHeld(cs_main);
3717+ const auto& chainstate = ActiveChainstate();
3718+ {
3719+ LOCK(cs_main);
3720+ // Don't report headers presync progress if we already have a post-minchainwork header chain.
3721+ // This means we lose reporting for potentially legimate, but unlikely, deep reorgs, but
instagibbs
commented at 1:58 pm on August 30, 2022:
legitimate
instagibbs
commented at 2:09 pm on August 30, 2022:
member
Logging looks great!
edit: Gah, it just took my node forever to find a peer to give me all the headers after a disconnect I guess. peerinfo is showing a new peer is almost caught up and it should start printing again…
edit2: everything normal and logging continuing as expected by the code. All good here.
I guess the note is that if you have a flapping connection during presync, you’ll waste some bandwidth and time.
fanquake merged this
on Aug 30, 2022
fanquake closed this
on Aug 30, 2022
sidhujag referenced this in commit
c19a7c43d0
on Aug 30, 2022
in
src/pow.cpp:102
in
3add234546
97+
98+ // Round and then compare this new calculated value to what is
99+ // observed.
100+ arith_uint256 maximum_new_target;
101+ maximum_new_target.SetCompact(largest_difficulty_target.GetCompact());
102+ if (maximum_new_target < observed_new_target) return false;
I think comment could be added that last received header has been detected to be consensus-invalid as the new target is above what is allowed by the difficulty adjustment bounds in CalculateNextWorkRequired(). Same for the lower bound, if my understanding is correct.
headerssync.cpp logs this case with an “invalid difficulty transition” message.
in
src/pow.h:30
in
3add234546
25+ * given height is not possible, given the proof-of-work on the prior block as
26+ * specified by old_nbits.
27+ *
28+ * This function only checks that the new value is within a factor of 4 of the
29+ * old value for blocks at the difficulty adjustment interval, and otherwise
30+ * requires the values to be the same.
Minor comment: while this function is only used for p2p DoS protection and the difficulty adjustment bounds checked should reject invalid headers, this distinction could be precised.
Bitcoin Core is used as a reference software for many bitcoin libraries across the ecosystem and there is often confusion between what is policy, consensus or optimizations based on historical consensus state like BIP90.
in
src/util/bitdeque.h:18
in
84852bb6bboutdated
13+#include <tuple>
14+
15+/** Class that mimics std::deque<bool>, but with std::vector<bool>'s bit packing.
16+ *
17+ * BlobSize selects the (minimum) number of bits that are allocated at once.
18+ * Larger values reduce the asymptotic memory usage overhead, at the cost of
I think it could be valuable to document against which type of workpayloads this is holding or if it’s independent of the container element insertion usage. Also it could be precised if the implementation is from scratch by following the standard library headers or inspired from an existent implementation to ease maintenance if there are relevant changes upstream. We do so in LDK.
Note, I think it would be valuable to document the interactions between ProcessNewBlockHeaders() and pindexLast. The method documentation says “ppindex If set, the pointer will be set to to point to the last new block index object for the given headers”. At that stage L2509 (net_processing.cpp), I understand that pindexLast should be always set as the unset case happens only if accepted=false L3696 (validation.cpp), afaict accepte=false induced that BlockValidationState is invalid from AcceptBlockHeader and therefore we should return L2506 (validation.cpp). If there is a refactor of the chain of nested calls and their return values, we could have a silent break.
Do we have cases where a) accepted=false, b) BlockValidationState is valid and c) pindexLast unset ? I don’t think so.
ariard
commented at 3:05 am on August 31, 2022:
member
Ongoing review commit by commit until 551a8d957. When done, will do another protocol-wise.
vasild
commented at 6:45 am on August 31, 2022:
contributor
Coverage report for modified or added lines by this PR and not covered by unit or functional tests.
in
test/functional/p2p_headers_sync_with_minchainwork.py:107
in
3add234546
jonatack
commented at 10:02 am on August 31, 2022:
Suggested test assertion
0--- a/test/functional/p2p_headers_sync_with_minchainwork.py
1+++ b/test/functional/p2p_headers_sync_with_minchainwork.py
2@@ -104,6 +104,9 @@ class RejectLowDifficultyHeadersTest(BitcoinTestFramework):
3 new_blocks.append(block)
4 hashPrevBlock = block.sha256
5 6+ # no low-work sync is in progress
7+ assert_equal(node.getpeerinfo()[0]['presynced_headers'], -1)
8+
9 headers_message = msg_headers(headers=new_blocks)
10 p2p.send_and_ping(headers_message)
in
src/headerssync.h:67
in
551a8d957coutdated
62+ * something we know. If a peer's chain has more than 2000 blocks, then we need
63+ * a way to verify that the chain actually has enough work on it to be useful to
64+ * us -- by being above our anti-DoS minimum-chain-work threshold -- before we
65+ * commit to storing those headers in memory. Otherwise, it would be cheap for
66+ * an attacker to waste all our memory by serving us low-work headers
67+ * (particularly for a new node coming online for the first time).
So I was wondering how our overall headers-sync logic would perform against potential best-chain blinding attacks, where an adversary with strong hashrate capabilities feed us with a quasi-unlimited number of minimum-chain-work headers chain to prevent the node from ever discovering the most-work valid chain (or at least significantly delay it), as we would continually iterate the “fake” chain space.
During IBD, we sync from a single peer (L5301, in net_processing.cpp), and for each new inv block received we’ll do initial-sync headers with one more peer (L3621, in `net_processing.cpp). So assuming the first peer picked up is controlled by the adversary to feed us one element of the “fake” chain space, each new block give us a chance to escape the space by syncing with a honest peer. That said, the adversary might control a number of inbound/outbound slots granting one more ~10min of blinding (I believe the criteria L3621 on which we pick up the one-more peer is falsifiable as it’s just an inv(block)).
Above IBD, we’ll send a GETHEADERS to all our peers (L5312, in net_processing.cpp), as such this blinding attack is reducible
I think the nMinimumChainWork puts a high bar in term of hashrate capabilities required, moreover our peer selection strategy would also need to be bypassed, or a high number of puppets nodes be deployed by the adversary AND the first preferred peer we’re selecting for initial-sync headers to be controlled by the attacker.
If this issue is concerning, I think we could request that the “one-more” initial-headers-sync peers as added by 05f7f31598 to be fPreferredDownload (outbound or blessed peer) as a slight hardening.
in
src/headerssync.h:216
in
551a8d957coutdated
211+ const SaltedTxidHasher m_hasher;
212+
213+ /** A queue of commitment bits, created during the 1st phase, and verified during the 2nd. */
214+ bitdeque<> m_header_commitments;
215+
216+ /** The (secret) offset on the heights for which to create commitments.
I wonder what is meant here by “(secret)”, if it should be interpreted as the offset not being known or observable by our peers. As it’s a software-defined value (HEADERS_COMMITMENT_PERIOD) in this module, I don’t know if there is a secrecy goal achieved.
The period is not secret, the offset is secret and non-observable (until it is too late for the attacker).
in
src/headerssync.cpp:224
in
551a8d957coutdated
219+
220+ int64_t next_height = m_redownload_buffer_last_height + 1;
221+
222+ // Ensure that we're working on a header that connects to the chain we're
223+ // downloading.
224+ if (header.hashPrevBlock != m_redownload_buffer_last_hash) {
Note, one of the requirement of ValidateAndStoreRedownloadedHeader is to have to check the continuity of the headers chain “Assumes the caller has already verified the headers are continuous”. This requirement is done L2372 in net_processing.cpp thus I wonder if this check can be effectively hit, and there is not a redundancy.
ariard
commented at 2:59 am on September 1, 2022:
member
Still reviewing 551a8d95
hebasto referenced this in commit
f79d612fba
on Sep 1, 2022
in
src/headerssync.cpp:142
in
3add234546
137+}
138+
139+bool HeadersSyncState::ValidateAndStoreHeadersCommitments(const std::vector<CBlockHeader>& headers)
140+{
141+ // The caller should not give us an empty set of headers.
142+ Assume(headers.size() > 0);
255+ // target blockhash just because we ran out of commitments.
256+ if (!m_process_all_remaining_headers && next_height % HEADER_COMMITMENT_PERIOD == m_commit_offset) {
257+ if (m_header_commitments.size() == 0) {
258+ LogPrint(BCLog::HEADERSSYNC, "Initial headers sync aborted with peer=%d: commitment overrun at height=%i (redownload phase)\n", m_id, next_height);
259+ // Somehow our peer managed to feed us a different chain and
260+ // we've run out of commitments.
Here we could invite the node operator to submit a bug report, as if a peer effectively managed to feed us a different low-work chain and exhaust our commitments it would be a data point than a) this anti-DoS headers sync mitigation can be falsified and b) we have adversaries in the wild with hashing capability actively targeting nodes.
in
src/net_processing.cpp:388
in
3add234546
381@@ -381,6 +382,15 @@ struct Peer {
382 /** Time of the last getheaders message to this peer */
383 NodeClock::time_point m_last_getheaders_timestamp{};
384385+ /** Protects m_headers_sync **/
386+ Mutex m_headers_sync_mutex;
387+ /** Headers-sync state for this peer (eg for initial sync, or syncing large
388+ * reorgs) **/
IIUC, syncing large reorgs is defined here by if we receive chain headers forking more than the 144 block buffer as documented in GetAntiDoSWorkThreshold(). Any fork under that threshold should have its headers accepted without the anti-DoS headers mechanism playing out.
ariard
commented at 6:11 pm on September 1, 2022:
member
Good with commit 551a8d957
in
src/net_processing.cpp:878
in
ed6cddd98eoutdated
874@@ -875,7 +875,7 @@ class PeerManagerImpl final : public PeerManager
875 EXCLUSIVE_LOCKS_REQUIRED(!m_most_recent_block_mutex, peer.m_getdata_requests_mutex) LOCKS_EXCLUDED(::cs_main);
876877 /** Process a new block. Perform any post-processing housekeeping */
878- void ProcessBlock(CNode& node, const std::shared_ptr<const CBlock>& block, bool force_processing);
879+ void ProcessBlock(CNode& node, const std::shared_ptr<const CBlock>& block, bool force_processing, bool min_pow_checked);
Rational under which conditions min_pow_checked can be setup to true could be added. E.g, when anti-dos checks have been carried out like in compact block processing or when the block have been submitted by a RPC caller (like in generateblock() or submitheader(). Or replicate the comments added for AcceptBlockHeader(), ProcessNewBlock(), ProcessNewBlockHeaders().
in
src/net_processing.cpp:5041
in
83c6a0c524outdated
5032@@ -5034,6 +5033,27 @@ void PeerManagerImpl::MaybeSendAddr(CNode& node, Peer& peer, std::chrono::micros
5033 }
5034 }
50355036+void PeerManagerImpl::MaybeSendSendHeaders(CNode& node, Peer& peer)
5037+{
5038+ // Delay sending SENDHEADERS (BIP 130) until we're done with an
5039+ // initial-headers-sync with this peer. Receiving headers announcements for
5040+ // new blocks while trying to sync their headers chain is problematic,
5041+ // because of the state tracking done.
IIUC the rational, in the lack of this delay of sending SENDHEADERS, we could have duplicated headers-sync with a peer. One, with HEADERS reception triggering continuous GETHEADERS in ProcessHeadersMessage() (L2854). A second, when we receive an INV(block_hash) at INV reception, as the “1 new headers-sync peer every new block” logic (L3622).
Note, delaying sending SENDHEADERS might have interference with old versions, or least BIP130 mentions “As support for sendheaders is optional, software that implements this may also optionally impose additional constraints, such as only honoring sendheaders messages shortly after a connection is established”.
Even if it’s a validation-related log, the peer id could be provided to indicate the headers source in case of anomalies detection.
ariard
commented at 6:21 pm on September 2, 2022:
member
Reviewed until tip commit 3add2345
ariard
commented at 2:33 am on September 3, 2022:
member
Post-Merge ACK3add2345
Reviewed the code, the HeadersSyncState state machine correctness, the soundness of the anti-DoS mechanism design and partially the test coverage.
From my understanding, the new bitdeque<> m_headers_commitments have been deliberately designed to avoid memory DoS and it should scale well if we had more outbound full-relay/block-relay-only peers in the future.
The commitment scheme is relying on SipHash, which has been deliberately designed for performance and combined with m_max_commitments should avoid new CPU DoS.
I don’t think the commitment scheme could be abused for node fingerprint, as even if the random offset is unique per-node, it should be cleanup once the peer is disconnected, and as such non-observable.
This anti-DoS mechanism should not introduce any message amplification attacks, as we don’t increase the volume neither the frequency of our p2p messages based on the result of those anti-DoS checks, neither network-partitions vector as we stop processing on invalid/non-continuous headers.
domob1812 referenced this in commit
047aebfeda
on Sep 5, 2022
fresheneesz
commented at 3:04 pm on September 7, 2022:
none
If the primary attack being prevented is on memory usage, why require downloading the headers twice? Why not simply ensure the headers are removed from memory and store them on disk for later access? Easy enough to delete them if it turns out not enough PoW can be accumulated on that chain.
The larger question I have is: why are we putting in effort to remove the IBD checkpoint? The checkpoint is effective, efficient, and secure. “Confusion about checkpoints” is an absolutely terrible reason to remove them. Technical design decisions should not be made because some people are confused about “the role” effective solutions play. Please do not remove the checkpoint, its a critical piece of efficient IBD.
instagibbs
commented at 3:17 pm on September 7, 2022:
member
Why not simply ensure the headers are removed from memory and store them on disk for later access?
“simply” generally is not as simple as we’d like, and doesn’t solve disk-filling attacks
Please do not remove the checkpoint, its a critical piece of efficient IBD.
AFAIK it doesn’t speed up sync in non-adversarial situations, that would be “assumevalid” type parameters which aren’t being discussed for removal.
sdaftuar
commented at 3:26 pm on September 7, 2022:
member
If the primary attack being prevented is on memory usage, why require downloading the headers twice? Why not simply ensure the headers are removed from memory and store them on disk for later access? Easy enough to delete them if it turns out not enough PoW can be accumulated on that chain.
To expand on what @instagibbs wrote, this would turn a memory attack into a disk-filling attack. Moreover, deleting headers is not simple; in order to ensure we always can arrive at consensus, we need an algorithm for determining when it is safe to do so. If a peer is serving us a chain that seems to have low-work, how do we know that it’s safe to delete, and that if we wait long enough we won’t eventually learn a high-work chain from this peer? This PR implements an algorithm for making that determination in a robust way that should guarantee that we can always reach consensus with the network.
(I do think that we could prune headers on startup/shutdown, though I imagine that after this PR there will be much less need to consider doing so.)
The larger question I have is: why are we putting in effort to remove the IBD checkpoint?
This PR does not touch the checkpoints; it solves a problem that the checkpoints do not solve (preventing DoS attacks against new nodes), and happens to do so in a way that would make checkpoints no longer necessary. However it’s more appropriate to discuss this in #25725, the PR where that is proposed.
As an aside for reviewers who may not be familiar with how checkpoints work: there is not a single “IBD checkpoint”, there are many heights for which we have cached blockhashes which we expect at those heights, and once we download those block headers matching those checkpointed hashes, we no longer accept new headers that would fork below those heights. Note that this is not a performance optimization during IBD, it is merely a protection against spam from low-difficulty headers.
fresheneesz
commented at 4:49 am on September 8, 2022:
none
this would turn a memory attack into a disk-filling attack
The block headers are currently 60 megabytes. In another 15 years they’ll be 120 megabytes. That doesn’t seem like enough to credibly attack any full node. Plus, this PR is about only downloading “enough PoW” which means you don’t need anywhere near that much disk space. Seems like the ideal way to do this is to download the latest X block headers, since those are the ones with the most PoW per bytes of block headers.
deleting headers is not simple; in order to ensure we always can arrive at consensus, we need an algorithm for determining when it is safe to do so.
This PR already deletes headers as soon as they’re received. If it simply stores them somewhere (and I do mean simply, ie not connecting it up with anything that would make it hard to safely delete) then they can be retrieved from that location instead of redownloading them.
But maybe I just don’t understand the attack that deleting the headers upon receipt is intended to prevent. Is that attack detailed anywhere? Why would the node need to download more than one set of headers at a time? A node should query its peers for the longest chain, and if there are competing longest chains, it should choose the longest/heaviest chain and verify via the headers that the chain is as heavy as it says, then start verifying the actual blocks.
Under what circumstance is there a possibility of a disk-filling attack here?
sdaftuar
commented at 2:00 pm on September 8, 2022:
member
The block headers are currently 60 megabytes. In another 15 years they’ll be 120 megabytes. That doesn’t seem like enough to credibly attack any full node.
Due to the timewarp bug[1], it is inexpensive to construct a consensus-valid blockchain where block timestamps advance by one second every 6 blocks. Starting at the genesis block timestamp (January 3, 2009), it’s possible to construct a chain that has over 2.5 billion blocks, where the last block’s timestamp is less than 2 hours in the future from the current time (September 2022). If an adversary produced such a chain and fed it to a new node, then storing it on disk while we calculate the accumulated work would mean we store almost 100 GB of data for a single peer (assuming we could, say, compress the headers from 80 bytes to 40 bytes). Of course, nodes can have many peers, up to 100 or more – so without complex logic to limit the number of concurrent headers syncs (which would also need to be carefully designed to thwart adversaries that seek to block with headers sync from an honest peer) – that could be 10TB of data being stored on disk.
Plus, this PR is about only downloading “enough PoW” which means you don’t need anywhere near that much disk space. Seems like the ideal way to do this is to download the latest X block headers, since those are the ones with the most PoW per bytes of block headers.
There is no functionality in the p2p protocol to “download the latest X block headers”.[2]
It would be great to design a protocol that would allow for proving the work on a chain in a compact/efficient way[3], but even if we had such a proposal ready to go, that wouldn’t solve our problem – in order to ensure we don’t break consensus with old nodes, we cannot deploy a new p2p protocol and then only sync headers with new software that implements the protocol (at least, not without a sufficiently long period of time to ensure adequate deployment of the new software). Thus we’d remain open to the DoS vector fixed in this PR until such time that we turned off headers sync from older software – and I think that is inferior to just fixing this issue for new software in a backwards compatible way, as this PR does.
But maybe I just don’t understand the attack that deleting the headers upon receipt is intended to prevent. Is that attack detailed anywhere?
I don’t know if my description above of the time warp bug is enough for you to understand the issue, but if you want some more background perhaps you can try the bitcoin stack exchange? There is also a code comment in #25970 that you might find helpful, starting here.
Why would the node need to download more than one set of headers at a time?
Different peers might be on different chain tips, and the simplest behavior we can have is to sync headers with all of our peers, without regard to one another. This can be bandwidth-wasteful when peers are on the same tip, so we have optimizations on startup to try to only sync with a single peer at first. But if that peer is adversarial or broken then they might not give us the honest chain, so at some point we have to try other peers as well, to ensure that as long as we have at least one honest peer, we’ll eventually sync the consensus chain (see #25720 for more discussion of this bandwidth tradeoff). While you could try to design a sync behavior that would throttle or rate-limit headers sync in some way, so that we don’t have multiple peers with which we’re syncing headers at the same time, that is a more complex logic than always being willing to sync headers with all peers that might have something to offer us (which only requires per-peer logic that is independent of what other peers might be giving us).
A node should query its peers for the longest chain, and if there are competing longest chains, it should choose the longest/heaviest chain and verify via the headers that the chain is as heavy as it says, then start verifying the actual blocks.
Yes, obviously. The details around how that is accomplished are what is being discussed, and the way we had been doing it prior to this PR was vulnerable to DoS, and hopefully now that this PR has been merged that should no longer be the case. If you have a concrete proposal for another way we could implement DoS protection while syncing headers in a way that will always keep us in consensus with the network, please suggest some code, but vague comments like this are not specific enough to discuss.
[2] The way headers are requested is via a “block locator”, which is a set of block hashes that describe the chain that a node is on. When our peer receives the locator, they are expected to find the first block hash in the locator that is on their chain, and then start sending us headers from that point on, towards their tip. Headers messages have a maximum of 2000 headers in them, so the node receiving a full headers message in response is expected to continue requesting headers from the last header in the message.
[3] Many years ago, there was a proposal to revamp the way headers are requested to try to allow for something like this, but this effort stalled and was not implemented or deployed. This might be an interesting avenue for future work.
instagibbs
commented at 2:09 pm on September 8, 2022:
member
Plus, this PR is about only downloading “enough PoW” which means you don’t need anywhere near that much disk space. Seems like the ideal way to do this is to download the latest X block headers, since those are the ones with the most PoW per bytes of block headers.
@sdaftuar comments aside, this is simply not true. Mining difficulty can drop, and we don’t want to lose consensus due to this, with nodes online following lower PoW headers, and newly syncing nodes deciding it’s not high enough.
jonatack
commented at 7:20 am on September 16, 2022:
contributor
2022-09-15 <sipa> For 25717 observing the pre-syncing phase is one thing (it should be there), but arguably the more interesting property is that syncing still works at all. It’s only triggered when syncing a new node from scratch, or one that is ~months or more behind.
This is a metadata mirror of the GitHub repository
bitcoin/bitcoin.
This site is not affiliated with GitHub.
Content is generated from a GitHub metadata backup.
generated: 2025-05-10 15:12 UTC
This site is hosted by @0xB10C More mirrored repositories can be found on mirror.b10c.me