p2p: Make stalling timeout adaptive during IBD #25880

mzumsande commented at 7:44 PM on August 19, 2022: contributor

During IBD, there is the following stalling mechanism if we can't proceed with assigning blocks from a 1024 lookahead window because all of these blocks are either already downloaded or in-flight: We'll mark the peer from which we expect the current block that would allow us to advance our tip (and thereby move the 1024 window ahead) as a possible staller. We then give this peer 2 more seconds to deliver a block (BLOCK_STALLING_TIMEOUT) and if it doesn't, disconnect it and assign the critical block we need to another peer.

Now the problem is that this second peer is immediately marked as a potential staller using the same mechanism and given 2 seconds as well - if our own connection is so slow that it simply takes us more than 2 seconds to download this block, that peer will also be disconnected (and so on...), leading to repeated disconnections and no progress in IBD. This has been described in #9213, and I have observed this when doing IBD on slower connections or with Tor - sometimes there would be several minutes without progress, where all we did was disconnect peers and find new ones.

The 2s stalling timeout was introduced in #4468, when blocks weren't full and before Segwit increased the maximum possible physical size of blocks - so I think it made a lot of sense back then. But it would be good to revisit this timeout now.

This PR makes the timout adaptive (idea by sipa): If we disconnect a peer for stalling, we now double the timeout for the next peer (up to a maximum of 64s). If we connect a block, we half it again up to the old value of 2 seconds. That way, peers that are comparatively slower will still get disconnected, but long phases of disconnecting all peers shouldn't happen anymore.

Fixes #9213

sipa commented at 7:51 PM on August 19, 2022: member

Nice observation.

(Brainstorm idea) How about something like doubling the timeout every time it causes a disconnection. And then reducing/resetting it when the window actually moves?

DrahtBot added the label P2P on Aug 19, 2022

luke-jr commented at 8:40 PM on August 20, 2022: member

Collect statistics from recent block download times during IBD and have a dynamic timeout based on this. (Introduces more complexity, but might be better in certain situations, e.g. when 6s aren't sufficient either).

Rather than this, it might be better to track download speeds from each peer, and check the speeds of this peer after 2 seconds.

For an immediate fix, though, maybe just making the timeout configurable would be a good idea?

Perhaps as an interim between these two ideas, if we disconnect N stalling peers, start increasing the timeout.

mzumsande commented at 2:21 PM on August 22, 2022: contributor

Rather than this, it might be better to track download speeds from each peer, and check the speeds of this peer after 2 seconds.

We would also have to compare it to the speed of others and have some criterion what deviation would be enough to disconnect.

How about something like doubling the timeout every time it causes a disconnection. And then reducing/resetting it when the window actually moves?

Perhaps as an interim between these two ideas, if we disconnect N stalling peers, start increasing the timeout.

Thanks! These suggestions are similar, make a lot of sense to me, and don't look very invasive to implement, planning to change to this approach and update soon.

mzumsande marked this as a draft on Aug 22, 2022

mzumsande force-pushed on Aug 24, 2022

mzumsande commented at 3:06 PM on August 24, 2022: contributor

I now implemented the suggestion by @sipa to double the timeout and updated the OP.

I tested this manually by catching up to the best chain with an ~1 month old datadir with -onlynet=tor (slow, blocks take ~10s to download), while reducing BLOCK_DOWNLOAD_WINDOW and MAX_BLOCKS_IN_TRANSIT_PER_PEER to make stalling situations happen more quickly.

mzumsande renamed this:
~~p2p: Increase `BLOCK_STALLING_TIMEOUT` timeout during IBD~~
p2p: Make `BLOCK_STALLING_TIMEOUT` timeout adaptive during IBD
on Aug 24, 2022

mzumsande marked this as ready for review on Aug 24, 2022

mzumsande renamed this:
~~p2p: Make `BLOCK_STALLING_TIMEOUT` timeout adaptive during IBD~~
p2p: Make stalling timeout adaptive during IBD
on Aug 24, 2022

ajtowns commented at 2:31 PM on August 26, 2022: contributor

Nice! This seems a fine improvement.

I think one way of looking at stalling is that it happens when one peer's bandwidth is less than 1/64th of the total bandwidth (64 = 1024/16 = window/max in transit) [0].

I think that means there might be a clever way of preventing slow nodes from stalling the download by reducing the in transit limit instead -- so that instead of supplying 16 blocks in the time it takes the other peers to supply 1008 to avoid stalling, the peer only needs to supply 8 or 4 blocks in the time it takes the other peers to supply 1016 or 1020. [1]

I think adding the blocks only nodes probably made this slightly worse, since there are now 2 extra peers, so now you need something like 25% more bandwidth in order to still have 1/64th of the total...

[0] Measured in blocks of course, so even if your bandwidth in bytes is fine, you might be unlucky to be asked for 16 blocks that are 2MB each, while everyone else is just being asked for 1008 50kB blocks at you (32MB total vs 7.2MB per peer).

[1] Perhaps you could implement this by keeping a global and per-peer exponential rolling average of how many blocks you download per second; then you could set the peer's in-transit limit to 1024 * peer_avg / global_avg; capping it at 16, and marking the peer as stalling and disconnecting if that value drops below 2 (in which case the remaining peers each have 50x this peer's bandwidth)?

in src/net_processing.cpp:1732 in 686936c92a outdated

1728 | @@ -1723,6 +1729,9 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
1729 |              m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
1730 |          }
1731 |      }
1732 | +    if (m_chainman.ActiveChainstate().IsInitialBlockDownload()) {

luke-jr commented at 8:27 PM on August 26, 2022:

I would think only reduce the timeout if the block was in fact downloaded in less time (than the would-be new timeout).

mzumsande commented at 9:54 PM on August 30, 2022:

I like about the current solution that it is very simple, prevents the node from getting stuck, and doesn't require additional bookkeeping of historical block download times.

I think your suggestion wouldn't be completely straightforward to implement: The block being connected here might have been downloaded some time in the past, saved to disk, but only connected now (as a result of its predecessor being connected). So we'd need some data structure to keep track of download times for not-yet-connected blocks and add/remove entries from it during IBD.

If we did this, it would help us cycle through less peers in situations where we assign multiple blocks to a peer and halving the timeout after successfully downloading a block would lead to a stalling situation again - but note that there are also other sources of disconnections that could be improved if we kept track of this kind data during IBD: E.g. if doubling the timeout is not sufficient and we'd need to 4x or 8x it.

So if we want something better but more complicated (with bookkeeping), my feeling is that we should go for another approach altogether, like basing the stalling timeout on a running average over the last received block times from multiple peers instead of a doubling/halving approach.

luke-jr commented at 10:08 PM on August 30, 2022:

With this current code, it will stall, get a block, stall, get a block, stall, etc repeatedly...

Ellajoke commented at 10:12 PM on August 30, 2022:

Okay

mzumsande commented at 10:38 PM on August 30, 2022:

With this current code, it will stall, get a block, stall, get a block, stall, etc repeatedly...

Yes, but only a few times until the blocks preventing the tip from moving are downloaded, then the tip advances by connecting the large number of stashed blocks from the 1024 window, ending the stalling situation. If every peer is equally slow, it doesn't matter if you download a block in 2s or 1 minute from the viewpoint of the stalling logic.

mzumsande commented at 10:54 PM on August 30, 2022:

Maybe instead of halving the timeout, we should, on block connection, multiply it with a factor 0.5 < f < 1 to let it go back slower?

vasild commented at 8:23 AM on August 31, 2022:

I agree it seems better to decrease it slowly. If it was a single slow peer, then there would be many blocks coming on time afterwards.

mzumsande commented at 5:39 PM on September 14, 2022:

With the latest push, I decrease it by a factor 0.85 with each connected block.

mzumsande commented at 10:30 PM on August 30, 2022: contributor

I think that means there might be a clever way of preventing slow nodes from stalling the download by reducing the in transit limit instead -- so that instead of supplying 16 blocks in the time it takes the other peers to supply 1008 to avoid stalling, the peer only needs to supply 8 or 4 blocks in the time it takes the other peers to supply 1016 or 1020. [1]

That sounds like a very interesting alternative approach. I'm not sure I understand it completely though: Are you suggesting to assign slower peers less blocks simultaneously, to help prevent stalling situations from occurring in the first place? And also move away from the concept that a stalling situation occurs only when we can't move the 1024 block window forward, but make it dependent on the other peers instead, so that we'd possibly disconnect slow peers much earlier than that if they are slower in comparison to faster ones?

mzumsande force-pushed on Aug 30, 2022

w0xlt approved

w0xlt commented at 1:08 AM on August 31, 2022: contributor

ACK https://github.com/bitcoin/bitcoin/pull/25880/commits/7c8c4e47137aa8c6e96163f55ccfadc75ba84c1a

in src/net_processing.cpp:5246 in 7c8c4e4713 outdated

5242 |              LogPrintf("Peer=%d is stalling block download, disconnecting\n", pto->GetId());
5243 |              pto->fDisconnect = true;
5244 | +            // Increase timeout for the next peer so that we don't disconnect multiple peers if our own
5245 | +            // bandwidth is insufficient.
5246 | +            m_block_stalling_timeout = std::min(2 * m_block_stalling_timeout.load(), MAX_BLOCK_STALLING_TIMEOUT);
5247 | +            LogPrint(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", m_block_stalling_timeout.load().count());

vasild commented at 7:34 AM on August 31, 2022:

Once the max is reached, this would print repeated messages like

Increased timeout to 64 seconds
Increased timeout to 64 seconds
Increased timeout to 64 seconds
...

which would be misleading because the timeout was not increased.

vasild commented at 8:09 AM on August 31, 2022:

This sequence is not atomic. If two threads execute concurrently the increase/decrease code it would lead to unexpected results. Consider this, it only does the inc/dec if no other thread changed the value in the meantime, otherwise leaves it untouched:

<details> <summary>atomic</summary>

diff --git i/src/net_processing.cpp w/src/net_processing.cpp
index 42686f0db0..fd2f22cdcd 100644
--- i/src/net_processing.cpp
+++ w/src/net_processing.cpp
@@ -1726,14 +1726,16 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
         LOCK(cs_main);
         for (const auto& ptx : pblock->vtx) {
             m_txrequest.ForgetTxHash(ptx->GetHash());
             m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
         }
     }
-    if (m_chainman.ActiveChainstate().IsInitialBlockDownload()) {
-        m_block_stalling_timeout = std::max(m_block_stalling_timeout.load() / 2, DEFAULT_BLOCK_STALLING_TIMEOUT);
+    auto stalling_timeout = m_block_stalling_timeout.load();
+    const auto new_timeout = std::max(stalling_timeout / 2, DEFAULT_BLOCK_STALLING_TIMEOUT);
+    if (m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout) && stalling_timeout != new_timeout) {
+        LogPrint(BCLog::NET, "Decreased stalling timeout to %d seconds\n", new_timeout.count());
     }
 }
 
 void PeerManagerImpl::BlockDisconnected(const std::shared_ptr<const CBlock> &block, const CBlockIndex* pindex)
 {
     // To avoid relay problems with transactions that were previously
@@ -5231,22 +5233,25 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
                 }
         }
         if (!vInv.empty())
             m_connman.PushMessage(pto, msgMaker.Make(NetMsgType::INV, vInv));
 
         // Detect whether we're stalling
-        if (state.m_stalling_since.count() && state.m_stalling_since < current_time - m_block_stalling_timeout.load()) {
+        auto stalling_timeout = m_block_stalling_timeout.load();
+        if (state.m_stalling_since.count() && state.m_stalling_since < current_time - stalling_timeout) {
             // Stalling only triggers when the block download window cannot move. During normal steady state,
             // the download window should be much larger than the to-be-downloaded set of blocks, so disconnection
             // should only happen during initial block download.
             LogPrintf("Peer=%d is stalling block download, disconnecting\n", pto->GetId());
             pto->fDisconnect = true;
             // Increase timeout for the next peer so that we don't disconnect multiple peers if our own
             // bandwidth is insufficient.
-            m_block_stalling_timeout = std::min(2 * m_block_stalling_timeout.load(), MAX_BLOCK_STALLING_TIMEOUT);
-            LogPrint(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", m_block_stalling_timeout.load().count());
+            const auto new_timeout = std::min(2 * stalling_timeout, MAX_BLOCK_STALLING_TIMEOUT);
+            if (m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout) && stalling_timeout != new_timeout) {
+                LogPrint(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", new_timeout.count());
+            }
             return true;
         }
         // In case there is a block that has been in flight from this peer for block_interval * (1 + 0.5 * N)
         // (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout.
         // We compensate for other peers to prevent killing off peers due to our own downstream link
         // being saturated. We only count validated in-flight blocks so peers can't advertise non-existing block hashes

</details>

mzumsande commented at 5:40 PM on September 14, 2022:

Thanks, I took your suggestion and added you as coauthor - didn't know about compare_exchange_strong before.

mzumsande commented at 5:59 PM on September 14, 2022:

Fixed according to your suggestion.

in src/net_processing.cpp:5237 in 7c8c4e4713 outdated

5233 | @@ -5225,12 +5234,16 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
5234 |              m_connman.PushMessage(pto, msgMaker.Make(NetMsgType::INV, vInv));
5235 |  
5236 |          // Detect whether we're stalling
5237 | -        if (state.m_stalling_since.count() && state.m_stalling_since < current_time - BLOCK_STALLING_TIMEOUT) {
5238 | +        if (state.m_stalling_since.count() && state.m_stalling_since < current_time - m_block_stalling_timeout.load()) {

vasild commented at 8:13 AM on August 31, 2022:

The increase would happen regardless of IBD, but decrease - only if IBD. In theory, this means that once out of IBD, the timeout could only increase or stay unchanged, but never decrease. Is this intended?

Maybe better have the IBD condition on both inc/dec or remove it from both.

mzumsande commented at 5:41 PM on September 14, 2022:

Removed the IBD condition

in src/net_processing.cpp:1733 in 7c8c4e4713 outdated

1728 | @@ -1723,6 +1729,9 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
1729 |              m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
1730 |          }
1731 |      }
1732 | +    if (m_chainman.ActiveChainstate().IsInitialBlockDownload()) {
1733 | +        m_block_stalling_timeout = std::max(m_block_stalling_timeout.load() / 2, DEFAULT_BLOCK_STALLING_TIMEOUT);

vasild commented at 8:15 AM on August 31, 2022:

The decrease does not log anything, but increase does. For symmetry and more log clarity, is it better to log both operations?

mzumsande commented at 5:41 PM on September 14, 2022:

Done.

vasild commented at 8:18 AM on August 31, 2022: contributor

Concept ACK

Some adaptivity seems to be warranted, because network throughput is rarely a constant.

naumenkogs commented at 10:10 AM on August 31, 2022: member

Concept ACK. I will give a code review ACK once you resolve vasil's comments :)

ajtowns commented at 3:51 PM on August 31, 2022: contributor

That sounds like a very interesting alternative approach.

Yeah, it shouldn't hold up this fix though, I think.

I'm not sure I understand it completely though: Are you suggesting to assign slower peers less blocks simultaneously, to help prevent stalling situations from occurring in the first place?

Yes.

And also move away from the concept that a stalling situation occurs only when we can't move the 1024 block window forward, but make it dependent on the other peers instead, so that we'd possibly disconnect slow peers much earlier than that if they are slower in comparison to faster ones?

Not really. I think you need to keep the 1024 block window (since increasing that hurts pruning), and I think that if the window gets full you should still call that "stalling".

But I think if you change the MAX_BLOCKS_IN_TRANSIT_PER_PEER so that slower peers have fewer in-transit blocks, then you'll be stalling much less often, and may not need to disconnect them at all -- that lets you stay connected to tor peers during IBD (for partition/sybil resistance) even if your ipv4/ipv6 peers are much faster. Maybe you could disconnect them when their max_blocks_in_transit drops to 1 or 2, before they actually cause stalling?

Maybe a more specific example would be worthwhile. As it stands, if your first peer will give you one block every 5 seconds, and your other 9 peers will collectively give you 14 blocks every second (on average, 7.8 times faster than the first peer, in total 70 times faster), then by the time that first peer has downloaded blocks 1..15 (which takes 75 seconds), the other peers will have given you blocks 17..1039 after 73.1 seconds, and stalling gets triggered. But if the slow peer had only queued up 8 blocks, then it would have supplied them in 40 seconds, which only gives the other peers enough time to supply 560 blocks, so they won't fill up the window. Hmm, I guess it ought to be possible to simulate that scenario via the functional test's P2PInterface stuff...

dergoegge commented at 9:39 AM on September 9, 2022: member

Concept ACK

I think it would be nice if this PR also added some tests, because it looks like we didn't have any tests for the stalling mechanism in the first place.

mzumsande commented at 5:50 PM on September 12, 2022: contributor

Will address feedback soon (and work on adding a test for the stalling logic).

mzumsande force-pushed on Sep 14, 2022

mzumsande commented at 5:58 PM on September 14, 2022: contributor

a764c20 to 48e5385:

Addressed review comments: In particular, don't halve the adaptive timeout each time a block is connected, but let it go back to the default value more slowly
Added a functional test for the stalling logic.

mzumsande force-pushed on Sep 14, 2022

DrahtBot commented at 12:07 AM on September 15, 2022: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	vasild, naumenkogs, achow101
Concept ACK	dergoegge, sipa, RandyMcMillan
Stale ACK	w0xlt, luke-jr

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

No conflicts as of last run.

DrahtBot cross-referenced this on Sep 15, 2022 from issue tests: Run both descriptor and legacy tests within a single test invocation by achow101

in test/functional/p2p_ibd_stalling.py:85 in 48e538510b outdated

  80 | +            for id in range(8):
  81 | +                peers.append(node.add_outbound_p2p_connection(P2PStaller(stall_block), p2p_idx=id, connection_type="outbound-full-relay"))
  82 | +                peers[-1].block_store = block_dict
  83 | +                peers[-1].send_message(headers_message)
  84 | +            self.wait_until(lambda: self.total_blocks_sent(peers) == NUM_BLOCKS - 2)
  85 | +        time.sleep(0.5)  # Wait until all blocks have arrived at the node

dergoegge commented at 1:39 PM on September 15, 2022:

Not a fan of timeouts like this but in case i also don't see how to avoid it.

mzumsande commented at 7:56 PM on September 20, 2022:

yes, I was looking for an RPC we could wait_until for to avoid this - but I didn't find a way of querying the number of blocks a node has downloaded (including those not connected to the chain yet).

glozow commented at 9:42 AM on September 29, 2022:

getpeerinfo()[0]["bytesrecv_per_msg"]["block"] ?

vasild commented at 4:51 PM on October 19, 2022:

To follow @glozow's idea, this should work:

diff --git i/test/functional/p2p_ibd_stalling.py w/test/functional/p2p_ibd_stalling.py
index 78626c003b..d593187d74 100755
--- i/test/functional/p2p_ibd_stalling.py
+++ w/test/functional/p2p_ibd_stalling.py
@@ -79,13 +79,14 @@ class P2PIBDStallingTest(BitcoinTestFramework):
         with self.nodes[0].assert_debug_log([], unexpected_msgs=['Stall started']):
             for id in range(8):
                 peers.append(node.add_outbound_p2p_connection(P2PStaller(stall_block), p2p_idx=id, connection_type="outbound-full-relay"))
                 peers[-1].block_store = block_dict
                 peers[-1].send_message(headers_message)
             self.wait_until(lambda: self.total_blocks_sent(peers) == NUM_BLOCKS - 2)
-        time.sleep(0.5)  # Wait until all blocks have arrived at the node
+            self.wait_until(lambda: self.total_bytes_recv_for_blocks() == 172761)
+
 
         self.log.info("Check that increasing the window beyond 1024 blocks triggers stalling logic")
         headers_message.headers = [CBlockHeader(b) for b in blocks]
         with self.nodes[0].assert_debug_log(expected_msgs=['Stall started peer=0']):
             for p in peers:
                 p.send_message(headers_message)
@@ -149,9 +150,15 @@ class P2PIBDStallingTest(BitcoinTestFramework):
     def total_blocks_sent(self, peers):
         num_blocks = 0
         for p in peers:
             num_blocks += p.blocks_sent
         return num_blocks
 
+    def total_bytes_recv_for_blocks(self):
+        total = 0
+        for info in self.nodes[0].getpeerinfo():
+            total += info["bytesrecv_per_msg"]["block"]
+        return total
+
 
 if __name__ == '__main__':
     P2PIBDStallingTest().main()

and even better if we can put some formula behind the magic number.

vasild commented at 9:16 AM on October 20, 2022:

or, more pythonish:

self.wait_until(lambda: sum(e["bytesrecv_per_msg"]["block"] for e in self.nodes[0].getpeerinfo()) == 172761)

mzumsande commented at 8:53 PM on October 21, 2022:

Thanks, that works! We send 1023 blocks, the formula would be 126 * 168 + 897 * 169 = 172761 (at some point the blocks get larger by 1 byte). Not super stable because bytesrecv_per_msg includes the extra 24 bytes magic etc. (not just the payload) - so I think the magic number would e.g. be different for BIP324 - but it's definitely better than a fixed timeout.

mzumsande commented at 10:29 PM on October 21, 2022:

I ended up using the less pythonish version because we also need a check that the "block" field exists in "bytesrecv_per_msg", and I find that easier to read. Added a comment for the magic number.

in test/functional/p2p_ibd_stalling.py:109 in 48e538510b outdated

  98 | +        self.wait_until(lambda: self.is_block_requested(peers, stall_block))
  99 | +        # Make sure that SendMessages() is invoked, which assigns the missing block
 100 | +        # to another peer and start the stalling logic for them
 101 | +        self.all_sync_send_with_ping(peers)
 102 | +
 103 | +        self.log.info("Check that the stalling timeout gets doubled to 4 seconds for the next staller")

dergoegge commented at 1:45 PM on September 15, 2022:

This test actually almost passes for the previous stalling logic as well, it only fails on Check that stalling timeout gets reduced back to 2 seconds after connecting several blocks (because of the assertion on the debug log).

Maybe add a condition to check that after 3 seconds the stalling peer is not disconnected? (same for the next doubling to 8 seconds)

mzumsande commented at 9:15 PM on September 20, 2022:

Done with the latest push.

dergoegge commented at 1:52 PM on September 15, 2022: member

The net processing changes look good to me, left some comments on the test.

in src/net_processing.cpp:1735 in 48e538510b outdated

1729 | @@ -1723,6 +1730,12 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
1730 |              m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
1731 |          }
1732 |      }
1733 | +    auto stalling_timeout = m_block_stalling_timeout.load();
1734 | +    // In case the dynamic timeout was doubled once or more, reduce it slowly back to its default value
1735 | +    const auto new_timeout = std::max(std::chrono::duration_cast<std::chrono::seconds> (stalling_timeout * 0.85), DEFAULT_BLOCK_STALLING_TIMEOUT);

sipa commented at 6:01 PM on September 15, 2022:

Is it intentional that this happens with an accuracy of 1 second?

So for example you could have the sequence 64, 54, 45, 38, 32, 27, 22, 18, 15, 12, 10, 8, 6, 5, 4, 3, 2... seconds.

The stalling_timeout != new_timeout condition can also be placed before the exchange, I think?

luke-jr commented at 7:36 PM on September 20, 2022:

Probably should be IMO. Not sure how efficient std::atomic is with std::chrono units.

mzumsande commented at 9:15 PM on September 20, 2022:

Is it intentional that this happens with an accuracy of 1 second?

It was kind of intentional, I first thought of changing m_block_stalling_timeout to microseconds, requiring conversions for the logging etc. but then I wondered whether a higher accuracy adds anything - I also thought of simply decreasing it constantly by one second per block received (instead of using a factor). Do you or others have a preference here?

The stalling_timeout != new_timeout condition can also be placed before the exchange, I think?

Done - I changed the order within the if statements (in two places).

sipa commented at 6:04 PM on September 15, 2022: member

Concept ACK

amovfx commented at 11:44 PM on September 18, 2022: none

Reviewed for the pr club.

in src/net_processing.cpp:1736 in 41a2b941cd outdated

1729 | @@ -1723,6 +1730,12 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
1730 |              m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
1731 |          }
1732 |      }
1733 | +    auto stalling_timeout = m_block_stalling_timeout.load();
1734 | +    // In case the dynamic timeout was doubled once or more, reduce it slowly back to its default value
1735 | +    const auto new_timeout = std::max(std::chrono::duration_cast<std::chrono::seconds> (stalling_timeout * 0.85), DEFAULT_BLOCK_STALLING_TIMEOUT);
1736 | +    if (m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout) && stalling_timeout != new_timeout) {

naumenkogs commented at 6:44 AM on September 20, 2022:

Let's say compare_exchange_strong triggers the false path, meaning m_block_stalling_timeout was updated elsewhere.

If I understand this method correctly, it will update stalling_timeout to the new value of m_block_stalling_timeout. Then, assuming stalling_timeout != new_timeout passes, the logging will be wrong?

We haven't decreased the actual value here
The latest value is for sure something different than new_timeout

I suspect this event could be even impossible in our execution, but then I'm wondering if there's a way to express this observation in code, rather than having a meaningless code branch.

mzumsande commented at 6:56 PM on September 20, 2022:

According to https://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange, compare_exchange_strong returns true if the underlying atomic value was successfully changed, false otherwise. So my understanding is that if nothing was updated because m_block_stalling_timeout was changed by some other thread, the logging statement won't be executed.

naumenkogs commented at 9:09 AM on September 21, 2022:

According to https://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange, compare_exchange_strong returns true if the underlying atomic value was successfully changed, false otherwise.

I think this is only a part of what's happening, and if you consider the latter (also from your link), my example would make more sense?

Atomically compares the object representation (until C++20)value representation (since C++20) of *this with that of expected, and if those are bitwise-equal, replaces the former with desired (performs read-modify-write operation). Otherwise, loads the actual value stored in *this into expected (performs load operation).

mzumsande commented at 3:24 PM on September 21, 2022:

I thought you meant stalling_timeout possibly being updated, which might have had an influence on the second condition (stalling_timeout != new_timeout). My point was that this couldn't possibly matter here because if stalling_timeout is updated, that must go together with compare_exchange_strong() returning false, and the two conditions are combined with an &&-operator so the entire clause is false (and the second condition will not even be evaluated due to short-circuiting).

Note that I changed the order of the two condition with the last push, so stalling_timeout != new_timeout is now the first of the two.

naumenkogs commented at 8:30 AM on September 23, 2022:

You are right, I was wrong about the previous version. The updated version is better anyway :) feel free to resolve this.

in src/net_processing.cpp:1734 in 48e538510b outdated

1729 | @@ -1723,6 +1730,12 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
1730 |              m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
1731 |          }
1732 |      }
1733 | +    auto stalling_timeout = m_block_stalling_timeout.load();
1734 | +    // In case the dynamic timeout was doubled once or more, reduce it slowly back to its default value

luke-jr commented at 7:34 PM on September 20, 2022:

In the normal case, the timeout is already at the default, so probably good to optimise for that:

if (stalling_timeout > DEFAULT_BLOCK_STALLING_TIMEOUT) {

mzumsande commented at 9:16 PM on September 20, 2022:

Done, makes sense to me.

in src/net_processing.cpp:117 in 48e538510b outdated

 114 | -static constexpr auto BLOCK_STALLING_TIMEOUT{2s};
 115 | +/** Default time during which a peer must stall block download progress before being disconnected.
 116 | +  * the actual timeout is increased temporarily if peers are disconnected for hitting the timeout */
 117 | +static constexpr auto DEFAULT_BLOCK_STALLING_TIMEOUT{2s};
 118 | +/** Maximum timeout for stalling block download. */
 119 | +static constexpr auto MAX_BLOCK_STALLING_TIMEOUT{64s};

luke-jr commented at 7:35 PM on September 20, 2022:

nit: Perhaps BLOCK_STALLING_TIMEOUT_{DEFAULT,MAX} would be better?

mzumsande commented at 9:17 PM on September 20, 2022:

Done, maybe that makes it easier to see that they are used for the same thing.

luke-jr changes_requested

mzumsande force-pushed on Sep 20, 2022

mzumsande commented at 9:19 PM on September 20, 2022: contributor

48e5385 to 4b0dbc0: Addressed review feedback

luke-jr approved

luke-jr commented at 1:17 AM on September 21, 2022: member

utACK 4b0dbc0f3eb8c57944f9037e017b89c912048206

amovfx commented at 6:32 PM on September 21, 2022: none

Reviewed changes for pr review club.

in src/net_processing.cpp:1736 in 4b0dbc0f3e outdated

1729 | @@ -1723,6 +1730,15 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
1730 |              m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
1731 |          }
1732 |      }
1733 | +    auto stalling_timeout = m_block_stalling_timeout.load();
1734 | +
1735 | +    // In case the dynamic timeout was doubled once or more, reduce it slowly back to its default value
1736 | +    if (stalling_timeout > BLOCK_STALLING_TIMEOUT_DEFAULT) {

naumenkogs commented at 8:37 AM on September 23, 2022:

I would probably replace this with the following code (haven't compiled):

Assume(stalling_timeout >= BLOCK_STALLING_TIMEOUT_DEFAULT);
if (stalling_timeout != BLOCK_STALLING_TIMEOUT_DEFAULT)
....

No big deal though.

mzumsande commented at 8:17 PM on October 24, 2022:

done, I agree that it should be impossible that stalling_timeout is below the default.

naumenkogs commented at 8:48 AM on September 23, 2022: member

utACK 4b0dbc0f3eb8c57944f9037e017b89c912048206

glozow requested review from dergoegge on Oct 12, 2022

glozow requested review from sipa on Oct 12, 2022

DrahtBot cross-referenced this on Oct 19, 2022 from issue refactor: Guard TxRequestTracker by its own lock instead of cs_main by dergoegge

in test/functional/p2p_ibd_stalling.py:42 in 4b0dbc0f3e outdated

  37 | +
  38 | +    def on_getdata(self, message):
  39 | +        for inv in message.inv:
  40 | +            self.getdata_requests.append(inv.hash)
  41 | +            if (inv.type & MSG_TYPE_MASK) == MSG_BLOCK:
  42 | +                if(not self.is_stalling or inv.hash != self.stall_block):

vasild commented at 3:52 PM on October 19, 2022:

nit: missing space after if

in test/functional/p2p_ibd_stalling.py:108 in 4b0dbc0f3e outdated

 103 | +        self.log.info("Check that the stalling timeout gets doubled to 4 seconds for the next staller")
 104 | +        # No disconnect after just 3 seconds
 105 | +        self.mocktime += 3
 106 | +        node.setmocktime(self.mocktime)
 107 | +        self.all_sync_send_with_ping(peers)
 108 | +        assert(len(self.nodes[0].getpeerinfo()) == 7)

vasild commented at 4:24 PM on October 19, 2022:

for consistency with earlier and because assert_equal() gives better error message:

        assert_equal(len(self.nodes[0].getpeerinfo()), 7)

in test/functional/p2p_ibd_stalling.py:121 in 4b0dbc0f3e outdated

 116 | +        self.log.info("Check that the stalling timeout gets doubled to 8 seconds for the next staller")
 117 | +        # No disconnect after just 7 seconds
 118 | +        self.mocktime += 7
 119 | +        node.setmocktime(self.mocktime)
 120 | +        self.all_sync_send_with_ping(peers)
 121 | +        assert(len(self.nodes[0].getpeerinfo()) == 6)

vasild commented at 4:25 PM on October 19, 2022:

here too

        assert_equal(len(self.nodes[0].getpeerinfo()), 6)

vasild commented at 4:49 PM on October 19, 2022: contributor

ACK 027abcfd846ad4939e823b492de8703504472bd7

This is the last but one commit, without the test (the last commit).

The test takes 1m45s on my laptop and deterministically fails with the following output:

<details> <summary>test failure</summary>

2022-10-19T16:32:15.542000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_6vaknmm8
2022-10-19T16:32:16.089000Z TestFramework (INFO): Prepare blocks without sending to the node
2022-10-19T16:32:16.173000Z TestFramework (INFO): Check that stalling does not start the stalling logic yet if there is a 1024 block window.

=== Note: 1m40s have passed ===

2022-10-19T16:33:58.948000Z TestFramework (INFO): Check that increasing the window beyond 1024 blocks triggers stalling logic
2022-10-19T16:34:00.977000Z TestFramework (ERROR): Assertion failed
Traceback (most recent call last):
  File ".../test/functional/test_framework/test_framework.py", line 133, in main
    self.run_test()
  File "./test/functional/p2p_ibd_stalling.py", line 91, in run_test
    p.send_message(headers_message)
  File "/usr/local/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File ".../test/functional/test_framework/test_node.py", line 423, in assert_debug_log
    self._raise_assertion_error('Expected messages "{}" does not partially match log:\n\n{}\n\n'.format(str(expected_msgs), print_log))
  File ".../test/functional/test_framework/test_node.py", line 167, in _raise_assertion_error
    raise AssertionError(self._node_msg(msg))
AssertionError: [node 0] Expected messages "['Stall started peer=0']" does not partially match log:

 - 2022-10-19T16:33:58.962946Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=4
 - 2022-10-19T16:33:58.963134Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 0b0f67ff3588e364b00c1c3d81bf9c97f03a1be5e419e3a7ad30771264a2531e peer=4
 - 2022-10-19T16:33:58.990215Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=3
 - 2022-10-19T16:33:58.990423Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 28cdc0d0e9229e11a2acb27aedc48b1b8b0797f0aae951e2af19484e1657444a peer=3
 - 2022-10-19T16:33:59.017936Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=7
 - 2022-10-19T16:33:59.018141Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 1d774e4493f1a5479a887f2486e0b9c08ec551047b8bc3d0ec34a64ea4961d37 peer=7
 - 2022-10-19T16:33:59.045861Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=5
 - 2022-10-19T16:33:59.046067Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 1cb6855b049f1ad7ce3d6f3d387e667c5efe4fbe5f5ae58fad60ab270e29028c peer=5
 - 2022-10-19T16:33:59.073667Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=6
 - 2022-10-19T16:33:59.073861Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 2659080a77588a7961790375bd7e87665ce8bef110500b83ede37f8fcb78239d peer=6
 - 2022-10-19T16:33:59.098941Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=1
 - 2022-10-19T16:33:59.099112Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 789b3151fd14751ca737bc2f9404619348e8292c3a88dc7e223a1407eb160e50 peer=1
 - 2022-10-19T16:33:59.123602Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=7
 - 2022-10-19T16:33:59.123776Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 300a1c8ce3ec1487d024f9208b33329f4f516822c96f0668793b7d2b05dc29a9 peer=7
 - 2022-10-19T16:33:59.148051Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=5
 - 2022-10-19T16:33:59.148327Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 37cc7b66fb6822b94008f45e99c14a643e7c672d485dad3806487c7d1954d227 peer=5
 - 2022-10-19T16:33:59.172596Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=4
 - 2022-10-19T16:33:59.172769Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 04114e74efbe1632f13632b61637902a018b5b4783c359ceac6c4ba4a009cea1 peer=4
 - 2022-10-19T16:33:59.197258Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=0
 - 2022-10-19T16:33:59.197424Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 56a92992336e241ab1ca9847cb8ac49c32737642aff0daf6eeec9a0f36dbd6a9 peer=0
 - 2022-10-19T16:33:59.221526Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=2
 - 2022-10-19T16:33:59.221695Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 223935072469d097518309b24f4fdb55292879da781632f6342627036a587479 peer=2
 - 2022-10-19T16:33:59.245923Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=6
 - 2022-10-19T16:33:59.246110Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 2bbfff35767e9b833c0becfc7bfd471ed26bc301d8bf0d62f79489531c65dc96 peer=6
 - 2022-10-19T16:33:59.271447Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=3
 - 2022-10-19T16:33:59.271638Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 4d04d79bb11c0dcbb80a4dd75ebc1edf99e9e418ec320f1b00627a8e6048074d peer=3
 - 2022-10-19T16:33:59.296551Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=7
 - 2022-10-19T16:33:59.296724Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 561e40bdc4ae9ea5157017447f11d1c3a2ff0bf0d8946e9c23977f9bd237bac6 peer=7
 - 2022-10-19T16:33:59.320649Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=5
 - 2022-10-19T16:33:59.320815Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 2240ca37da79ebb62af65b3ccf64ede02646d9944903acb35d7f552172df3a5c peer=5
 - 2022-10-19T16:33:59.345287Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=1
 - 2022-10-19T16:33:59.345471Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 53fec40b1cc0906773a036205e6c43cf84777e5e6c9b3d1964af88292f934abe peer=1
 - 2022-10-19T16:33:59.370857Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=0
 - 2022-10-19T16:33:59.371040Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 2feb53f9a978d4b26fae6928fc233cb4bc1aa41fc3eaba37b8efe9456a4f356d peer=0
 - 2022-10-19T16:33:59.396133Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=3
 - 2022-10-19T16:33:59.396308Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 1ed11920cf800de1cd19fb7ab637fc4932c17e5dce3ffe37eb7cb21c88cfc0b2 peer=3
 - 2022-10-19T16:33:59.421024Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=6
 - 2022-10-19T16:33:59.421229Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 4ae79c9217806fd6120589ef5025b0b07b35b13f3f5fd94860e24668d2dcf9b6 peer=6
 - 2022-10-19T16:33:59.445160Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=2
 - 2022-10-19T16:33:59.445326Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 149c2e1b1fb8a27086466065f6272dc4628872b40c5981150fd839c702d0f27d peer=2
 - 2022-10-19T16:33:59.470239Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=4
 - 2022-10-19T16:33:59.470442Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 3b723e6be5ec0e7162deefa48363c601582ef674069f00208b22f15635657425 peer=4
 - 2022-10-19T16:33:59.494524Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=0
 - 2022-10-19T16:33:59.494718Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 61c11ae01b6fc69d1a12234463c5ca7b99a4886d42d2db2384f0426e64995e9c peer=0
 - 2022-10-19T16:33:59.520750Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=3
 - 2022-10-19T16:33:59.520936Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 51ebc573f486a5d3dc8481608af3b6807fc9e24cf9c7c1a8391bbe9d80797a74 peer=3
 - 2022-10-19T16:33:59.545276Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=6
 - 2022-10-19T16:33:59.545444Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 694424de8a6e1ad8506843500212048610c0b01311df7c9e13a7a91159ae7a88 peer=6
 - 2022-10-19T16:33:59.569763Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=1
 - 2022-10-19T16:33:59.569935Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 4d5f8c921bafba7a38bf8add0de97c47b3867a6ed22c46ef8df9ee91c371067b peer=1
 - 2022-10-19T16:33:59.594875Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=4
 - 2022-10-19T16:33:59.595051Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 27252479f5c989fa08bf50a9f863c8d36240ba03eee4be2dcdc1b668474982f9 peer=4
 - 2022-10-19T16:33:59.620172Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=5
 - 2022-10-19T16:33:59.620401Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 28fd48a46bbde9fe24fe631861b43a7ae0151b1b967bc36304e149177c6a9b59 peer=5
 - 2022-10-19T16:33:59.644291Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=7
 - 2022-10-19T16:33:59.644456Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 75aa0892be28704273dcdd5838822135bfa6616693791ad48bb5675c5003c866 peer=7
 - 2022-10-19T16:33:59.670471Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=2
 - 2022-10-19T16:33:59.670657Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 276e366ca671621fbc08621ee7728f80a4e9527837112210ea4db7750d72bd85 peer=2
 - 2022-10-19T16:33:59.695404Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=5
 - 2022-10-19T16:33:59.695579Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 76d7bdc226b02d83dcf8510b820509288dbb6d981be8481c5ca98ba15fc0c944 peer=5
 - 2022-10-19T16:33:59.719515Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=3
 - 2022-10-19T16:33:59.719703Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 04f543e95f6ae0ca18c5812969de94972497ca511f0d12286baae29ad4605d06 peer=3
 - 2022-10-19T16:33:59.743348Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=2
 - 2022-10-19T16:33:59.743514Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 0324eb869b493b26cc853da6feb37843ec27c9944cbc06bf7d612efcee61176d peer=2
 - 2022-10-19T16:33:59.770223Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=7
 - 2022-10-19T16:33:59.770429Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 7a69577738e6151784dc66530cd9dc3466d8ef6f5b9267e112766b7aaf95c62c peer=7
 - 2022-10-19T16:33:59.795478Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=0
 - 2022-10-19T16:33:59.795653Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 7a9d485626ae39b50d6759538d662893325d2197bbdbd9eddb4915b5be44e3b0 peer=0
 - 2022-10-19T16:33:59.819725Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=6
 - 2022-10-19T16:33:59.819897Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 5c32d19f2d2ba45c41980a4df729cfcde43629e77c5eaf74cf5dfb5ea8355e5b peer=6
 - 2022-10-19T16:33:59.844126Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=1
 - 2022-10-19T16:33:59.844294Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 7616519ecdee1302ccf29c620faf3e646a010bc7ec5eb80296f871b1a484e70a peer=1
 - 2022-10-19T16:33:59.868194Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=4
 - 2022-10-19T16:33:59.868366Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 49de72a97770ba40900ff30ce77cc964fe6288087719d1070c71c97fc7e261f9 peer=4
 - 2022-10-19T16:33:59.892402Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=2
 - 2022-10-19T16:33:59.892577Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 139cffa9d8d84257e055fbcd71ecdd6813ca52f6ac774936d5faeca59bab4deb peer=2
 - 2022-10-19T16:33:59.916548Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=7
 - 2022-10-19T16:33:59.916723Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 0e7b0d5ab8fe14f3e88f55d198cfacb76b80a9e5e5cb4c6af6165cbc136459c3 peer=7
 - 2022-10-19T16:33:59.941073Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=4
 - 2022-10-19T16:33:59.941248Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 1cd47ffbaeca18483cf019e113955a6d3462e030378fcf1183cd5d7c21abf749 peer=4
 - 2022-10-19T16:33:59.965355Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=3
 - 2022-10-19T16:33:59.965544Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 39b17af31068085d9eb0c0e9bca61d32fd0772b0e72880314ef514f266953cd9 peer=3
 - 2022-10-19T16:33:59.989803Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=5
 - 2022-10-19T16:33:59.989985Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 3b48f9f9934113353c763c0e7acb17958bdbbb0bc0ce8a48744d2aab0e917e0d peer=5
 - 2022-10-19T16:34:00.013752Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=1
 - 2022-10-19T16:34:00.013917Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 4acb6b528b8ae0aec99cdf53cec80e960e01ed6f55170d8f67c1682db4496aed peer=1
 - 2022-10-19T16:34:00.037851Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=0
 - 2022-10-19T16:34:00.038023Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 1c499198f890caf385f21569107d6c5ab6908c0c1c2bfb7804a0f7cdca1998d7 peer=0
 - 2022-10-19T16:34:00.064995Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=6
 - 2022-10-19T16:34:00.065266Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 5a0470c7b37492ef17fef835a90a508d195da1f94778eb58f819cbdd2e679a50 peer=6
 - 2022-10-19T16:34:00.090537Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=5
 - 2022-10-19T16:34:00.090711Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 148736dab33ceac12f53e6c77531ab4974aeecfc0573bce66104886b25b5c3a8 peer=5
 - 2022-10-19T16:34:00.114267Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=3
 - 2022-10-19T16:34:00.114432Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 599f0eefa6f2427de544d0c0f2d606c2cd38f367ae2b8df0462aaa970470a529 peer=3
 - 2022-10-19T16:34:00.138960Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=2
 - 2022-10-19T16:34:00.139141Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 3e7c9446183261888c02143e261307483bfe253cb0c35b481dd5b6eeed155aec peer=2
 - 2022-10-19T16:34:00.163578Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=1
 - 2022-10-19T16:34:00.163741Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 7ebfb4ed5d171579d32caadc83247fe62641ad0bcdeae3546f4b299ce2384f99 peer=1
 - 2022-10-19T16:34:00.187471Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=6
 - 2022-10-19T16:34:00.187646Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 0a6de82d0e73df6aeeabda7905c3e8a76ac6354856aaa0adfe1a06bf9011d1d5 peer=6
 - 2022-10-19T16:34:00.212043Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=4
 - 2022-10-19T16:34:00.212213Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 449b2e9ac551d16e0ad61cafb858fa0964d4afb16add46a2a3db36322d256dea peer=4
 - 2022-10-19T16:34:00.235885Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=7
 - 2022-10-19T16:34:00.236067Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 4d14c1a17a8d17642a279960e9f67fe8f43f9c934a8a20abbecc334def4f6f63 peer=7
 - 2022-10-19T16:34:00.259926Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=0
 - 2022-10-19T16:34:00.260097Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 03e7cc131f6cf0dddbc484d09d102403fc7e34c1fd39d925b592d6560984231e peer=0
 - 2022-10-19T16:34:00.284742Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=2
 - 2022-10-19T16:34:00.284912Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 14a08a465e344aa6422c4ab8f318d35f3b13aecc33973aa8ae706f5581f64895 peer=2
 - 2022-10-19T16:34:00.309434Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=3
 - 2022-10-19T16:34:00.309622Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 093853d8fcbce5ab96f51397d9c23e18aa6be55a139c8e067a4b8ae049142b1c peer=3
 - 2022-10-19T16:34:00.333272Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=5
 - 2022-10-19T16:34:00.333438Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 39b9d7e4b9e98aad90142bb948d7d722446445f3d7085b1b022c9bc78656ba79 peer=5
 - 2022-10-19T16:34:00.357753Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=6
 - 2022-10-19T16:34:00.357934Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 30bcffd1cebd30eb70935fda6617309ecb706a2cdecd62a9e454d75eec503c01 peer=6
 - 2022-10-19T16:34:00.382750Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=0
 - 2022-10-19T16:34:00.382919Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 0f2fc7fa7578caf56a71611c5c8b33a59d284d598d4dc4a8f963bf8cbe157ee0 peer=0
 - 2022-10-19T16:34:00.406745Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=1
 - 2022-10-19T16:34:00.406917Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 1b40884fb47a2537c61c7504770a5a7f13a806a37c3a876ea1e283e870746e27 peer=1
 - 2022-10-19T16:34:00.432092Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=7
 - 2022-10-19T16:34:00.432285Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 6623b5f52d96b06561aa098a065072c14faa6f2c9bb4f655c68cf0699ebd8a89 peer=7
 - 2022-10-19T16:34:00.456746Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=4
 - 2022-10-19T16:34:00.456921Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 684d2052d2cef9fe217c0218bf5eac19145d29656d8cae07158d0aedf3ac3b01 peer=4
 - 2022-10-19T16:34:00.481389Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=1
 - 2022-10-19T16:34:00.481578Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 6ae8c6448d3a3da973cb3a019d46d0fc945c6c68ce78b84757819beb02ff273a peer=1
 - 2022-10-19T16:34:00.505301Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=4
 - 2022-10-19T16:34:00.505472Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 6bd8dcf0649d38a5344626e114cf5a19f930b98236802199ba7fe0ee56f1c874 peer=4
 - 2022-10-19T16:34:00.529028Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=7
 - 2022-10-19T16:34:00.529208Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 4fdb5209ad3d5541aa4575c08592cc083b6fa891c8ab5c7906b764434d7bbb87 peer=7
 - 2022-10-19T16:34:00.552961Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=5
 - 2022-10-19T16:34:00.553131Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 273ebb6defbc1ab2c1a127bf354b287871caf8bdbf8b3aa733248870a1e0932d peer=5
 - 2022-10-19T16:34:00.577108Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=3
 - 2022-10-19T16:34:00.577277Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 71375bb705e9f500c34e40b292243d5af96a1c8ed274a41318d70877d3bc3a8e peer=3
 - 2022-10-19T16:34:00.601089Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=6
 - 2022-10-19T16:34:00.601261Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 2810c2d616e6eadd8cdafd2fa9619721f931ce0878cd9480ee2a7b3b8c2edd12 peer=6
 - 2022-10-19T16:34:00.625230Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=2
 - 2022-10-19T16:34:00.625404Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 22edc1556fac5bb72bd5956e99c4ba97cdba9944b456b0635464450da8a83065 peer=2
 - 2022-10-19T16:34:00.650258Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=0
 - 2022-10-19T16:34:00.650438Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 68f4ecf10bc6b7870cd29b08bf36aa4f02a12a88a874723bdb528d3916618c20 peer=0
 - 2022-10-19T16:34:00.674750Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=1
 - 2022-10-19T16:34:00.674920Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 3765435e040bf76c4e5d8ed8d5062a92f79b92ce926df4cdc2f412e726f068bc peer=1
 - 2022-10-19T16:34:00.699397Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=0
 - 2022-10-19T16:34:00.699567Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 2ad2276f3bbb64885dc403e0cbd2c04373c4ce6684b69513ab92ee2ab730ecee peer=0
 - 2022-10-19T16:34:00.722971Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=7
 - 2022-10-19T16:34:00.723138Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 46fd517915787dcb01c74ce4bca206430256f838e1fe798b7e3032abaf62668c peer=7
 - 2022-10-19T16:34:00.747254Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=6
 - 2022-10-19T16:34:00.747423Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 5139ec71264f1ace89cff79bde6b9c15140958415c048bc81d028c2e29e1680f peer=6
 - 2022-10-19T16:34:00.773599Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=3
 - 2022-10-19T16:34:00.773799Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 7049cf0ed83b30c32dc2d45e470dc01166182d11b9688895b51ee68b548aa984 peer=3
 - 2022-10-19T16:34:00.798494Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=5
 - 2022-10-19T16:34:00.798796Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 41b4aeba3f712952e794994f658cc7f10b435fda2be817c90cba612d5d51b926 peer=5
 - 2022-10-19T16:34:00.822235Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=4
 - 2022-10-19T16:34:00.822407Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 0092e6bbd325d4a53ebf8bef982ee9ca265fcf954ccc45b5dfc0982bdf98f0fc peer=4
 - 2022-10-19T16:34:00.845863Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=2
 - 2022-10-19T16:34:00.846027Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 425400fdf59d978678caf140ba07c6d0f4c174c8cb6eff98d95ee04b88553bbe peer=2
 - 2022-10-19T16:34:00.869512Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=1
 - 2022-10-19T16:34:00.869736Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 139b69e04292b866d094aa1a45ec04bf194406f0cc3c3d8645ed6cd181e3124c peer=1
 - 2022-10-19T16:34:00.894308Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=7
 - 2022-10-19T16:34:00.894496Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 7015b275815789410e79481152125f3a8d749919012e20539cc2136c709f00bf peer=7
 - 2022-10-19T16:34:00.918507Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=6
 - 2022-10-19T16:34:00.918674Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 0929efd49dc5f1cc048dff800a7711b065602fbb37090a478c950fa665643eae peer=6
 - 2022-10-19T16:34:00.943204Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=2
 - 2022-10-19T16:34:00.943384Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 115803feb52a86fb26558fa9e9b62db3565da69e59b4c2744d7029b1f4dae86f peer=2
 - 2022-10-19T16:34:00.967650Z [net_processing.cpp:2806] [ProcessMessage] [net] received: block (145 bytes) peer=5
 - 2022-10-19T16:34:00.967817Z [net_processing.cpp:4078] [ProcessMessage] [net] received block 0c89f3b8fb84a718d3aa4f030f9ae2b5b550a411f4fdbbe599bf8f122a4719eb peer=5


2022-10-19T16:34:01.032000Z TestFramework (INFO): Stopping nodes
2022-10-19T16:34:01.248000Z TestFramework (WARNING): Not cleaning up dir /tmp/bitcoin_func_test_6vaknmm8
2022-10-19T16:34:01.249000Z TestFramework (ERROR): Test failed. Test logging available at /tmp/bitcoin_func_test_6vaknmm8/test_framework.log
2022-10-19T16:34:01.249000Z TestFramework (ERROR): 
2022-10-19T16:34:01.249000Z TestFramework (ERROR): Hint: Call .../test/functional/combine_logs.py '/tmp/bitcoin_func_test_6vaknmm8' to consolidate all logs
2022-10-19T16:34:01.249000Z TestFramework (ERROR): 
2022-10-19T16:34:01.249000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log.
2022-10-19T16:34:01.249000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues
2022-10-19T16:34:01.249000Z TestFramework (ERROR):

</details>

in src/net_processing.cpp:1738 in 4b0dbc0f3e outdated

1729 | @@ -1723,6 +1730,15 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
1730 |              m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
1731 |          }
1732 |      }
1733 | +    auto stalling_timeout = m_block_stalling_timeout.load();
1734 | +
1735 | +    // In case the dynamic timeout was doubled once or more, reduce it slowly back to its default value
1736 | +    if (stalling_timeout > BLOCK_STALLING_TIMEOUT_DEFAULT) {
1737 | +        const auto new_timeout = std::max(std::chrono::duration_cast<std::chrono::seconds>(stalling_timeout * 0.85), BLOCK_STALLING_TIMEOUT_DEFAULT);
1738 | +        if (stalling_timeout != new_timeout && m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout)) {

vasild commented at 4:52 PM on October 19, 2022:

I think stalling_timeout != new_timeout will always be true?

mzumsande commented at 8:26 PM on October 24, 2022:

I agree, it became unnecessary after moving the code into the if statement. Removed.

in test/functional/p2p_ibd_stalling.py:97 in 4b0dbc0f3e outdated

  92 | +
  93 | +        self.log.info("Check that the stalling peer is disconnected after 2 seconds")
  94 | +        self.mocktime = int(time.time()) + 3
  95 | +        node.setmocktime(self.mocktime)
  96 | +        peers[0].wait_for_disconnect()
  97 | +        assert_equal(len(self.nodes[0].getpeerinfo()), 7)

vasild commented at 1:11 PM on October 20, 2022:

nit:

        assert_equal(self.nodes[0].num_test_p2p_connections(), 7)

mzumsande commented at 10:12 PM on October 21, 2022:

done everywhere

in test/functional/p2p_ibd_stalling.py:55 in 4b0dbc0f3e outdated

  52 | +        self.setup_clean_chain = True
  53 | +        self.num_nodes = 1
  54 | +
  55 | +    def run_test(self):
  56 | +        NUM_BLOCKS = 1025
  57 | +        node = self.nodes[0]

vasild commented at 1:12 PM on October 20, 2022:

nit: even with this variable self.nodes[0] is used in some places below. I think it is better to use just one of them.

mzumsande commented at 10:13 PM on October 21, 2022:

Used node everywhere.

vasild commented at 1:15 PM on October 20, 2022: contributor

The test takes 1m45s on my laptop

That was when compiled with TSAN. Normal debug build takes only about 30 seconds. It still fails. There are some problems with the test:

First test (should not stall): those received: block... messages in the log above (from the failure of the second test) are produced from the first test. The sleep(0.5) was apparently not enough, so I fixed it to sleep(5) (just for testing, not to actually have it in the final test). This means that in practice it could stall and remain undetected by the first test because it will be happy to not see Stall started in the log even though it may be printed shortly after the first test has eagerly declared success. We want to check that there are 1023 received: block messages in the log and that afterwards the stalling logic from SendMessages() is executed and after that there is no "Stall started" in the log. I am not sure how to do that. Checking the bytes received for block messages seems to be better than the sleep, but could still end the wait too early.
Second test (should stall): it fails because there is no "Stall started peer=0" message. I added sleep(10) at the end of the with... block to wait even more for the stall. Then it fails with this error:

AssertionError: [node 0] Expected messages "['Stall started peer=0']" does not partially match log:

 - 2022-10-20T12:16:52.334202Z [net_processing.cpp:2806] [ProcessMessage] [net] received: headers (83028 bytes) peer=1
 - 2022-10-20T12:16:55.712397Z [validation.cpp:3686] [ProcessNewBlockHeaders] Synchronizing blockheaders, height: 1025 (~0.17%)
 - 2022-10-20T12:16:55.712923Z [net_processing.cpp:2806] [ProcessMessage] [net] received: headers (83028 bytes) peer=8
 - 2022-10-20T12:16:59.078114Z [net_processing.cpp:5328] [SendMessages] [net] Stall started peer=1
 - 2022-10-20T12:16:59.078214Z [net_processing.cpp:2806] [ProcessMessage] [net] received: headers (83028 bytes) peer=4

There is a Stall started peer=1 message but it comes 7 seconds after received: headers and is for a different peer. Maybe we should instead wait for the message to appear with wait_for_debug_log() and omit the peer=0 part.

vasild commented at 1:27 PM on October 20, 2022: contributor

I am ok to drop the test. It is good to have tests to ensure the code works as intended. But we can't have tests for everything and there is a subjective threshold somewhere. If it is too difficult to implement properly or is more complicated than the actual code it tests, then it may be too expensive. There is maintenance cost for the test too. Developers could trash precious time investigating a sporadically failing test, fixing it or trying to figure out whether their (seemingly unrelated) changes broke the test. I am not saying to drop the test, just that I would be ok with that.

mzumsande force-pushed on Oct 21, 2022

mzumsande commented at 10:26 PM on October 21, 2022: contributor

Addressed the test feedback (will get to the outstanding comment for the main commit a bit later).

First test (should not stall): those received: block... messages in the log above (from the failure of the second test) are produced from the first test. The sleep(0.5) was apparently not enough, so I fixed it to sleep(5) (just for testing, not to actually have it in the final test). This means that in practice it could stall and remain undetected by the first test because it will be happy to not see Stall started in the log even though it may be printed shortly after the first test has eagerly declared success. We want to check that there are 1023 received: block messages in the log and that afterwards the stalling logic from SendMessages() is executed and after that there is no "Stall started" in the log. I am not sure how to do that. Checking the bytes received for block messages seems to be better than the sleep, but could still end the wait too early.

I rewrote the test such that it doesn't use the log anymore, but waits until all blocks are received, syncs (so that a peer could get mark as a staller), waits for 3s, syncs again (so that a peer could get disconnected), and then checks that no peer gets disconnected.

Second test (should stall): it fails because there is no "Stall started peer=0" message. I added sleep(10) at the end of the with... block to wait even more for the stall. Then it fails with this error:

I removed the peer=0 part of the check and added a missing self.all_sync_send_with_ping(peers) to the with block. With that, the tests succeeds for me even with some slow sanitizers enabled - will do more runs over the weekend to check for intermittent failures.

I am ok to drop the test. It is good to have tests to ensure the code works as intended. But we can't have tests for everything and there is a subjective threshold somewhere. If it is too difficult to implement properly or is more complicated than the actual code it tests, then it may be too expensive. There is maintenance cost for the test too. Developers could trash precious time investigating a sporadically failing test, fixing it or trying to figure out whether their (seemingly unrelated) changes broke the test. I am not saying to drop the test, just that I would be ok with that.

If everyone agrees that would be ok with me. However, the stalling logic was completely untested before, which is not ideal, so the test doesn't just cover the changes from this PR. @dergoegge do you have an opinion, since you suggested the test? Do you think that the stalling logic could maybe better be covered by a unit test after #25515? (which would have less problems with timeouts).

p2p: Make block stalling timeout adaptive

This makes the stalling detection mechanism (previously a fixed
timeout of 2s) adaptive:
If we disconnect a peer for stalling, double the timeout for the
next peer - and let it slowly relax back to its default
value each time the tip advances. (Idea by Pieter Wuille)

This makes situations more unlikely in which we'd keep on
disconnecting many of our peers for stalling, even though our
own bandwidth is insufficient to download a block in 2 seconds.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>

0565951f34

mzumsande force-pushed on Oct 24, 2022

mzumsande commented at 8:29 PM on October 24, 2022: contributor

9339230 to aceff9e: Also addressed the outstanding comments to the main commit (plus minor reformatting of comments) and fixed another source of spurious test failures - thanks for the reviews!

mzumsande force-pushed on Oct 25, 2022

in test/functional/p2p_ibd_stalling.py:33 in aceff9ed71 outdated

  28 | +)
  29 | +
  30 | +
  31 | +class P2PStaller(P2PDataStore):
  32 | +    def __init__(self, stall_block):
  33 | +        self.is_stalling = True

vasild commented at 6:38 AM on October 25, 2022:

self.is_stalling is set to True and never changed, I guess it can be removed?

mzumsande commented at 7:08 PM on October 27, 2022:

Removed

in test/functional/p2p_ibd_stalling.py:168 in aceff9ed71 outdated

 163 | +
 164 | +    def total_blocks_sent(self, peers):
 165 | +        num_blocks = 0
 166 | +        for p in peers:
 167 | +            num_blocks += p.blocks_sent
 168 | +        return num_blocks

vasild commented at 6:57 AM on October 25, 2022:

total_blocks_sent() is now unused, can be removed.

mzumsande commented at 7:08 PM on October 27, 2022:

yes, and blocks_sent in P2PStaller as well. Done.

in test/functional/p2p_ibd_stalling.py:92 in aceff9ed71 outdated

  87 | +        self.all_sync_send_with_ping(peers)
  88 | +        # If there was a peer marked for stalling, it would get disconnected
  89 | +        self.mocktime = int(time.time()) + 3
  90 | +        node.setmocktime(self.mocktime)
  91 | +        self.all_sync_send_with_ping(peers)
  92 | +        assert_equal(node.num_test_p2p_connections(), 8)

vasild commented at 7:01 AM on October 25, 2022:

This 8 is the same as the 8 in for id in range(8): above (or len(peers)). Would be better to put it in a constant, similarly to NUM_BLOCKS. Also below, 7, 6 and 5 are used which are that 8 minus 1, 2 or 3.

mzumsande commented at 7:08 PM on October 27, 2022:

Introduced NUM_PEERS an used it.

in test/functional/p2p_ibd_stalling.py:78 in aceff9ed71 outdated

  73 | +        headers_message = msg_headers()
  74 | +        headers_message.headers = [CBlockHeader(b) for b in blocks[:NUM_BLOCKS-1]]
  75 | +        peers = []
  76 | +
  77 | +        self.log.info("Check that a staller does not get disconnected if the 1024 block lookahead buffer is filled")
  78 | +        for id in range(8):

vasild commented at 8:17 AM on October 25, 2022:

The test takes about 1 minute on my laptop. The slowest parts are:

https://github.com/bitcoin/bitcoin/blob/aceff9ed71348f84b1454bb16dbfc5e804f2aaa4/test/functional/p2p_ibd_stalling.py#L78-L81

and this

https://github.com/bitcoin/bitcoin/blob/aceff9ed71348f84b1454bb16dbfc5e804f2aaa4/test/functional/p2p_ibd_stalling.py#L99

each one takes about 30 seconds.

Why 8 peers? We disconnect 3, so anything >=3 should be ok? I tried with 4 peers and the test passes, but I could be missing something. With 4 peers it takes about 30 seconds.

mzumsande commented at 7:19 PM on October 27, 2022:

The original idea was to have a scenario similar to actual IBD with 8 full outbound peers. But it's obviously not necessary, and since it slows down the test (I didn't experience such a long runtime, for me it was like ~15 seconds) I reduced it to 4 peers.

vasild approved

vasild commented at 8:17 AM on October 25, 2022: contributor

ACK aceff9ed71348f84b1454bb16dbfc5e804f2aaa4

test: add functional test for IBD stalling logic 39b93649c4

mzumsande force-pushed on Oct 27, 2022

mzumsande commented at 10:10 PM on October 27, 2022: contributor

aceff9e to 39b9364: addressed feedback by @vasild - the CI failure is unrelated (I opened #26404 to fix it).

vasild approved

vasild commented at 8:15 AM on October 28, 2022: contributor

ACK 39b93649c4b98cd82c64b957fd9f6a6fd3c2a359

fanquake requested review from naumenkogs on Oct 28, 2022

fanquake requested review from luke-jr on Oct 28, 2022

in src/net_processing.cpp:5256 in 0565951f34 outdated

5252 |              pto->fDisconnect = true;
5253 | +            // Increase timeout for the next peer so that we don't disconnect multiple peers if our own
5254 | +            // bandwidth is insufficient.
5255 | +            const auto new_timeout = std::min(2 * stalling_timeout, BLOCK_STALLING_TIMEOUT_MAX);
5256 | +            if (stalling_timeout != new_timeout && m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout)) {
5257 | +                LogPrint(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", m_block_stalling_timeout.load().count());

naumenkogs commented at 8:28 AM on November 4, 2022:

nit: why not use new_timeout here in logs?

It has the same meaning, unless there was some crazy concurrency, in which case the log might not make sense anyway....
It is shorted :)
It is probably more efficient
Reading the code is easier

naumenkogs commented at 8:33 AM on November 4, 2022:

actually, you do what i suggest in the decreasing code :)

mzumsande commented at 8:56 PM on January 27, 2023:

I somehow missed this, sorry, but added it to #26982.

naumenkogs approved

naumenkogs commented at 8:38 AM on November 4, 2022: member

ACK 39b93649c4b98cd82c64b957fd9f6a6fd3c2a359

achow101 commented at 10:32 PM on January 11, 2023: member

ACK 39b93649c4b98cd82c64b957fd9f6a6fd3c2a359

RandyMcMillan commented at 10:51 PM on January 11, 2023: contributor

Strong Concept ACK 39b93649c4b98cd82c64b957fd9f6a6fd3c2a359

Will do some tests ASAP.

in src/net_processing.cpp:1734 in 39b93649c4

1729 | @@ -1723,6 +1730,16 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
1730 |              m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
1731 |          }
1732 |      }
1733 | +
1734 | +    // In case the dynamic timeout was doubled once or more, reduce it slowly back to its default value

sipa commented at 10:56 PM on January 11, 2023:

I'm not sure we should do this every time a block is connected. Whenever a staller got disconnected, and the missing block arrived from another peer, we may suddenly be able to connect dozens of blocks at once. Performing the timeout-reduction step here 16 times suffices to get it back from the maximum 64 to the minimum 2.

I think it'd be better to drop it just once every time the download window moves, regardless of how much it moved.

sipa commented at 5:13 AM on January 12, 2023:

Actually, thinking more about this, I don't think that's ideal either. The window will likely move many times between would-be stalls, even when the stalling timeout has adapted to be close to the "correct" value.

We should aim to be in a position where the stalling timeout is sort of in an equilibrium between triggering occasionally but not all the time. I think the best way to achieve that is to:

Increase it when it triggers due to being too low (reducing the probability of triggering in the future) [implemented]
Decrease it when it didn't trigger due to being high enough. And I think we have a way of measuring that: when the stalling detection triggers, and the stalling timer starts, but then the timeout is not reached. And by seeing how long it actually took for before the stalling state is resolved we can even do better than just applying a % drop; e.g. we could set the new timeout to (old_timeout + actual_time_used) / 2.

achow101 commented at 6:34 PM on January 13, 2023:

@sipa Do you think that your proposed change should be implemented in this PR or can it be done in a followup? From my perspective, this PR seems to be strictly an improvement even if the stalling timeout backs off too aggressively.

mzumsande commented at 6:39 PM on January 13, 2023:

I haven't had time to look into this feedback closely yet - but I am planning to do that next week.

naumenkogs commented at 7:18 AM on January 16, 2023:

I think this could be a good follow-up, but anyway, here are my thoughts.

Decrease it when it didn't trigger due to being high enough.

I think this is a very good abstract policy.

when the stalling detection triggers, and the stalling timer starts, but then the timeout is not reached.

This sounds more efficient at doing what it's supposed to do than what's implemented in this PR currently.

; e.g. we could set the new timeout to (old_timeout + actual_time_used) / 2.

Sounds like a good concrete policy, but not going lower than 2 seconds probably. One could do some math modeling, but I don't think it's that helpful:

with random data, there is no ground truth — one would have to rely on the human sanity check of the inputs, which we already do verbally here;
could be tested against a couple laggy nodes too, comparing between different policies, but eh.

mzumsande commented at 7:49 PM on January 24, 2023:

From my perspective, this PR seems to be strictly an improvement even if the stalling timeout backs off too aggressively.

I'd say that with its current approach, the PR improves behavior in isolated stalling situations where there are currently repeated disconnections of many/all of our peers without making any progress in getting blocks - but a previous stalling situation will not affect the behavior in future stalling situations because all memory of the previous stalling incident will be lost after a few blocks: @sipa's suggestion would introduce a long-lasting memory of previous stalling situations

I think that one downside of this approach is that the moving window algorithm should only lead to a stalling situation if one peer is significantly slower than the rest of the peers. We'd want to replace this first peer usually - giving it more time based on previous stalling situations is probably not beneficial, because if it was comparably fast to other peers, this would not have led to this peer being marked as a staller in the first place.
The upside is in situations where the actual time to download a block for us is significantly larger than 2 seconds - we'd churn through multiple peers / timeout doublings again in every stalling situation right now until we reach the "correct" timeout, but wouldn't anymore with a longer memory.

Decrease it when it didn't trigger due to being high enough. And I think we have a way of measuring that: when the stalling detection triggers, and the stalling timer starts, but then the timeout is not reached. And by seeing how long it actually took for before the stalling state is resolved we can even do better than just applying a % drop; e.g. we could set the new timeout to (old_timeout + actual_time_used) / 2.

I think this would mean moving to the decrease of the stalling timeout to ProcessMessages (NetMsgType::BLOCK) where m_stalling_since is currently reset back to 0. At this point we haven't validated the block yet or connected it to the chain, wo we'd likely would need to at least make sure that we only decrease it after receiving the actual block that allows us to extend our chain (the peer might also have been sending us another block).

sipa commented at 6:12 PM on January 26, 2023:

@mzumsande I see. So the point isn't so much that you're trying to build something that tries to measure and converge towards an optimal long-term stalling timeout for your network conditions, but rather want something that deliberately gives a temporary "cool down" period after a stalling kick so it doesn't result in a flurry of disconnects.

So I think something like my suggestion still makes sense, but it's perhaps an orthogonal thing, and not for this PR.

achow101 merged this on Jan 27, 2023

achow101 closed this on Jan 27, 2023

sidhujag referenced this in commit bbcf59ea73 on Jan 27, 2023

mzumsande deleted the branch on Jan 27, 2023

in test/functional/p2p_ibd_stalling.py:154 in 39b93649c4

 149 | +        return total
 150 | +
 151 | +    def all_sync_send_with_ping(self, peers):
 152 | +        for p in peers:
 153 | +            if p.is_connected:
 154 | +                p.sync_send_with_ping()

maflcko commented at 5:13 PM on January 27, 2023:

This fails?

https://cirrus-ci.com/task/4620776167440384?logs=ci#L2630

 test  2023-01-27T16:28:47.291000Z TestFramework.p2p (DEBUG): Received message from 0:0: msg_pong(nonce=0000000f) 
 node0 2023-01-27T16:28:47.340846Z (mocktime: 2023-01-27T16:29:01Z) [http] [httpserver.cpp:239] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:51554 
 node0 2023-01-27T16:28:47.347460Z (mocktime: 2023-01-27T16:29:01Z) [httpworker.0] [rpc/request.cpp:179] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__ 
 node0 2023-01-27T16:28:47.347940Z (mocktime: 2023-01-27T16:29:01Z) [http] [httpserver.cpp:239] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:51554 
 node0 2023-01-27T16:28:47.348962Z (mocktime: 2023-01-27T16:29:01Z) [httpworker.1] [rpc/request.cpp:179] [parse] [rpc] ThreadRPCServer method=setmocktime user=__cookie__ 
 node0 2023-01-27T16:28:47.349250Z (mocktime: 2023-01-27T16:29:03Z) [http] [httpserver.cpp:239] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:51554 
 node0 2023-01-27T16:28:47.349271Z (mocktime: 2023-01-27T16:29:03Z) [httpworker.2] [rpc/request.cpp:179] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__ 
 node0 2023-01-27T16:28:47.393691Z (mocktime: 2023-01-27T16:29:03Z) [msghand] [net_processing.cpp:5738] [SendMessages] Peer=1 is stalling block download, disconnecting 
 node0 2023-01-27T16:28:47.398607Z (mocktime: 2023-01-27T16:29:03Z) [msghand] [net_processing.cpp:5744] [SendMessages] [net] Increased stalling timeout temporarily to 16 seconds 
 node0 2023-01-27T16:28:47.406186Z (mocktime: 2023-01-27T16:29:03Z) [net] [net.cpp:573] [CloseSocketDisconnect] [net] disconnecting peer=1 
 test  2023-01-27T16:28:47.414000Z TestFramework.p2p (DEBUG): Send message to 0:0: msg_ping(nonce=00000010) 
 test  2023-01-27T16:28:47.414000Z TestFramework.p2p (DEBUG): Closed connection to: 0:0 
 node0 2023-01-27T16:28:47.414007Z (mocktime: 2023-01-27T16:29:03Z) [http] [httpserver.cpp:239] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:51554 
 node0 2023-01-27T16:28:47.414037Z (mocktime: 2023-01-27T16:29:03Z) [httpworker.3] [rpc/request.cpp:179] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__ 
 node0 2023-01-27T16:28:47.414073Z (mocktime: 2023-01-27T16:29:03Z) [net] [net_processing.cpp:1541] [FinalizeNode] [net] Cleared nodestate for peer=1 
 test  2023-01-27T16:28:47.474000Z TestFramework (ERROR): Assertion failed 
                                   Traceback (most recent call last):
                                     File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/test_framework.py", line 134, in main
                                       self.run_test()
                                     File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/p2p_ibd_stalling.py", line 133, in run_test
                                       self.all_sync_send_with_ping(peers)
                                     File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/p2p_ibd_stalling.py", line 154, in all_sync_send_with_ping
                                       p.sync_send_with_ping()
                                     File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/p2p.py", line 560, in sync_send_with_ping
                                       self.sync_with_ping()
                                     File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/p2p.py", line 570, in sync_with_ping
                                       self.wait_until(test_function, timeout=timeout)
                                     File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/p2p.py", line 463, in wait_until
                                       wait_until_helper(test_function, timeout=timeout, lock=p2p_lock, timeout_factor=self.timeout_factor)
                                     File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/util.py", line 267, in wait_until_helper
                                       if predicate():
                                     File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/p2p.py", line 460, in test_function
                                       assert self.is_connected
                                   AssertionError
 test  2023-01-27T16:28:47.489000Z TestFramework (DEBUG): Closing down network thread

mzumsande commented at 6:50 PM on January 27, 2023:

looks like the p2p instance has been disconnected by bitcoind, but python hasn't received the callback yet, so it attempts to send a ping in between these events. I think counting the nodes with is_connected instead of using num_test_p2p_connections() will fix this. I'll open a PR.

mzumsande commented at 8:55 PM on January 27, 2023:

fixed in #26982

mzumsande cross-referenced this on Jan 27, 2023 from issue p2p: 25880 fixups (stalling timeout) by mzumsande

maflcko referenced this in commit 37fea41bbf on Jan 30, 2023

sidhujag referenced this in commit 0f4f31a4f0 on Jan 30, 2023

glozow cross-referenced this on Feb 24, 2023 from issue Socratic Seminar #22 Topics by glozow

pinheadmz cross-referenced this on Apr 27, 2023 from issue v 0.17.0 win 7 64 bit missing recent transactions. by VadimEv

Fabcien referenced this in commit 3a33b4bd41 on Jan 4, 2024

bitcoin locked this on Jan 27, 2024