Connect to a new outbound peer if our tip is stale

sdaftuar commented at 7:44 pm on October 25, 2017: member

This is an alternative approach to #11534. Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer.

Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to).

sdaftuar commented at 7:44 pm on October 25, 2017: member

I’m still working on a unit test for this, but wanted to get some eyes on it for consideration in 0.15.0.2.

in src/net_processing.cpp:2958 in a64bbc03dd outdated

2953+        // If we have more outbound peers than we target, disconnect one.
2954+        // Pick the outbound peer that least recently announced
2955+        // us a new block, with ties broken by choosing the more recent
2956+        // connection (higher node id)
2957+        NodeId worst_peer = -1;
2958+        int64_t oldest_block_announcement = time_in_seconds;

gmaxwell commented at 7:50 pm on October 25, 2017:

I’d prefer to see this be a int_64 max, so I don’t need to reason about how monotone our clocks aren’t.

sdaftuar commented at 7:56 pm on October 25, 2017:

Done

fanquake added the label P2P on Oct 25, 2017

fanquake added this to the milestone 0.15.0.2 on Oct 25, 2017

MarcoFalke added the label Needs backport on Oct 25, 2017

gmaxwell commented at 1:17 am on October 26, 2017: contributor

Remind me how the scheduler works… when blocks are slow is this going to result in all hosts on the network triggering their extra peer at basically the same time? Or do we not need to insert a bit of randomness in order to spread them out over the checking interval?

sdaftuar commented at 3:58 pm on October 26, 2017: member

Regarding the scheduler, I mentioned this on IRC to @gmaxwell:

gmaxwell: my understanding of the scheduler is that it’ll start scheduling callbacks at some point after startup, so i wouldn’t expect the network to be completely synced gmaxwell: also there’s random drift, since the scheduler schedules the next callback N milliseconds after the prior one finishes but worth discussing whether the interval is short enough that there still might be too much concentration of everyone trying to find a new peer

I updated this PR with a unit test.

laanwj added this to the "Review priority 0.15.0.2" column in a project

sdaftuar force-pushed on Oct 26, 2017

sdaftuar commented at 8:11 pm on October 26, 2017: member

Needed rebase after #11490.

in src/net.h:263 in 89412654c6 outdated

258+    void SetTryNewOutboundPeer(bool flag);
259+    bool GetTryNewOutboundPeer();
260+
261+    // Return the number of outbound peers we have in excess of our target (eg,
262+    // if we previously called SetTryNewOutboundPeer(true), and have since set
263+    // to false, we may have extra peers that we wish to disconnect).

TheBlueMatt commented at 10:30 pm on October 26, 2017:

You may want to note the (IMO) critical part of the cpp comment: “Also exclude peers that haven’t finished initial connection handshake yet” ie “this may return a value less than the number of outbound connections - the number of normal outbound connection slots in cases where some outbound connections are not yet fully connected”.

in src/net.cpp:1071 in 89412654c6 outdated

1067@@ -1068,7 +1068,7 @@ void CConnman::AcceptConnection(const ListenSocket& hListenSocket) {
1068     SOCKET hSocket = accept(hListenSocket.socket, (struct sockaddr*)&sockaddr, &len);
1069     CAddress addr;
1070     int nInbound = 0;
1071-    int nMaxInbound = nMaxConnections - (nMaxOutbound + nMaxFeeler);
1072+    int nMaxInbound = nMaxConnections - (nMaxOutbound + nMaxFeeler + nMaxExtraOutbound);

TheBlueMatt commented at 10:55 pm on October 26, 2017:

I’m confused - it looks like we wont ever have both a feeler and an extra outbound open at one time, why change this?

sdaftuar commented at 6:32 pm on October 27, 2017:

it looks like we wont ever have both a feeler and an extra outbound at one time

We will in general have both a feeler and an extra outbound at the same time. There was an incorrect comment (from a first iteration of this patch) describing the extra outbound peer as “stealing” the feeler connection, but that is not actually true. I’ll delete that comment.

TheBlueMatt commented at 10:56 pm on October 26, 2017: member

Didnt get very far yet, but given deadlines, figured I’d post early and often :p.

in src/net.cpp:1825 in 89412654c6 outdated

1821+
1822+        bool extra_outbound = GetTryNewOutboundPeer();
1823+
1824+        int desired_outbound = nMaxOutbound + (extra_outbound ? nMaxExtraOutbound : 0);
1825+
1826+        if (nOutbound >= desired_outbound) {

TheBlueMatt commented at 6:17 pm on October 27, 2017:

Care to add a comment here (or in the comment block above) noting that we will not make a new connection if we have our “extra slot” used by either a feeler or an “extra outbound”, both are not allowed.

sdaftuar commented at 9:36 pm on October 27, 2017:

Not sure if this comment was made before we chatted offline, but it is possible to have both a feeler and an “extra” outbound peer simultaneously (just not in the same loop iteration, but that’s nonsensical anyway since we only initiate a single connection in a single loop iteration, regardless).

in src/net.h:430 in 89412654c6 outdated

425@@ -413,6 +426,10 @@ class CConnman
426     std::thread threadOpenAddedConnections;
427     std::thread threadOpenConnections;
428     std::thread threadMessageHandler;
429+
430+    /** flag for stealing a feeler connection to find a new outbound peer */

sdaftuar commented at 6:25 pm on October 27, 2017:

whoops, this comment is wrong, will fix.

in src/net_processing.cpp:2403 in b258e7cefd outdated

2398@@ -2382,6 +2399,10 @@ bool static ProcessMessage(CNode* pfrom, const std::string& strCommand, CDataStr
2399         // because it is set in UpdateBlockAvailability. Some nullptr checks
2400         // are still present, however, as belt-and-suspenders.
2401 
2402+        if (received_new_header && pindexLast->nChainWork > chainActive.Tip()->nChainWork) {
2403+            nodestate->m_last_block_announcement = GetTime();

TheBlueMatt commented at 6:39 pm on October 27, 2017:

Shouldn’t there be an equivalent addition for compact block announcements?

sdaftuar commented at 9:22 pm on October 27, 2017:

oops, yes.

in src/net_processing.cpp:447 in c6fac39488 outdated

442+        g_last_tip_update = GetTime();
443+    }
444+    return g_last_tip_update < GetTime() - consensusParams.nPowTargetSpacing * 3 && mapBlocksInFlight.empty();
445+}
446+
447+// Requires cs_main

TheBlueMatt commented at 6:55 pm on October 27, 2017:

Please no more locking documentation - use the self-documenting AssertLockHeld or GUARDED_BY.

sdaftuar commented at 9:24 pm on October 27, 2017:

Sounds good

sdaftuar commented at 7:03 pm on October 27, 2017: member

Pushed up a few small fixups.

in src/net_processing.cpp:2986 in c6fac39488 outdated

2981+        if (worst_peer != -1) {
2982+            connman->ForNode(worst_peer, [&](CNode *pnode) {
2983+                // Only disconnect a peer that has been connected to us for
2984+                // some reasonable fraction of our check-frequency, to give
2985+                // it time for new information to have arrived.
2986+                if (time_in_seconds - pnode->nTimeConnected > STALE_CHECK_INTERVAL/2) {

TheBlueMatt commented at 7:08 pm on October 27, 2017:

Shouldnt this almost always be true (because we only call this function once every STALE_CHECK_INTERVAL)? Also, I think we can turn this down way more than 5 minutes, if we think we’re behind, and we connect to someone, it shouldn’t take them 5 minutes to send us a header, and if it does, we probably weren’t behind and our existing peers are fine, so we should just disconnect them.

sdaftuar commented at 9:24 pm on October 27, 2017:

Maybe almost always, but definitely not always – our outbound peers can initiate disconnects, too, and it might take us a while to find one.

I’ve pushed up changes to split the peer eviction onto a shorter timer (2.5 minutes), where we only require 30 seconds of connectivity to disconnect.

TheBlueMatt commented at 7:10 pm on October 27, 2017: member

Didnt review test changes yet.

sdaftuar commented at 8:43 pm on October 27, 2017: member

@TheBlueMatt raised concerns about holding the 9th connection for too long. One concern would be if we are reducing the number of available inbound slots on the network by too much. Another potentially more serious concern is that if we’re holding the 9th connection long enough for a new block to be found and relayed, then over time we’ll be selecting for peers with the fastest connectivity.

In the extreme, you could imagine that this algorithm would reduce to nodes grinding away, trying to connect to the peer that was closest to the miners (or the FIBRE network or similar), which is probably not what we want.

In this case, we only intend for the 9th connection to tell us whether we’re really behind (eg because our existing peers are all broken) or if the network is just slow. We can determine that very quickly after connection, so I think I’m going to change this so that we disconnect extra peers more quickly (ie on a faster timer than every 10 minutes, and requiring a shorter connection time), and reset the TryExtraOutboundPeer flag in CConnman after disconnecting an extra peer, essentially waiting until we decide at the next stale-tip-check that we’re still stale, before trying another 9th peer.

sdaftuar commented at 9:36 pm on October 27, 2017: member

Addressed @TheBlueMatt’s nits.

in src/net.cpp:1713 in 89412654c6 outdated

1709+// Exclude peers that are marked for disconnect, or are going to be
1710+// disconnected soon (eg one-shots and feelers)
1711+// Also exclude peers that haven't finished initial connection handshake yet
1712+// (so that we don't decide we're over our desired connection limit, and then
1713+// evict some peer that has finished the handshake)
1714+int CConnman::GetExtraOutboundCount()

theuni commented at 0:12 am on October 28, 2017:

I think this might be greatly simplified by extending CSemaphore to have an adjustable threshold.

sdaftuar commented at 3:11 pm on October 28, 2017:

Any thoughts on @TheBlueMatt’s suggestion (mentioned on IRC yesterday) to just use the feeler connection for the extra peer, rather than initiate a potential 10th connection?

https://botbot.me/freenode/bitcoin-core-dev/2017-10-27/?msg=92828477&page=2

I’m leaning towards doing that, so nMaxExtraOutbound would go away. I guess this function wouldn’t change though.

sipa commented at 6:22 pm on October 28, 2017: member

Needs rebase.

sdaftuar force-pushed on Oct 28, 2017

sdaftuar commented at 7:26 pm on October 28, 2017: member

Squashed and rebased.

theuni commented at 7:50 pm on October 28, 2017: member

Without seeing the code, I think that would probably be simpler and more obvious.

I wonder if we could actually combine the behavior with feelers. Launch a feeler every 20 min or so as usual, and rather than disconnecting immediately, replace a worse connection if necessary, and upgrade from feeler to full outbound.

On Oct 28, 2017 11:11 AM, “Suhas Daftuar” notifications@github.com wrote:

@sdaftuar commented on this pull request.

In src/net.cpp https://github.com/bitcoin/bitcoin/pull/11560#discussion_r147556553:

return m_try_another_outbound_peer; +}

+void CConnman::SetTryNewOutboundPeer(bool flag) +{

LOCK(m_cs_outbound_peer);

m_try_another_outbound_peer = flag; +}

+// Return the number of peers we have over our outbound connection limit +// Exclude peers that are marked for disconnect, or are going to be +// disconnected soon (eg one-shots and feelers) +// Also exclude peers that haven’t finished initial connection handshake yet +// (so that we don’t decide we’re over our desired connection limit, and then +// evict some peer that has finished the handshake) +int CConnman::GetExtraOutboundCount()

Any thoughts on @TheBlueMatt https://github.com/thebluematt’s suggestion (mentioned on IRC yesterday) to just use the feeler connection for the extra peer, rather than initiate a potential 10th connection?

https://botbot.me/freenode/bitcoin-core-dev/2017-10-27/? msg=92828477&page=2

I’m leaning towards doing that, so nMaxExtraOutbound would go away. I guess this function wouldn’t change though.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bitcoin/bitcoin/pull/11560#discussion_r147556553, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZdEzhjrlXqpR_iPo_-CIcNtvbYKR8cks5sw0QdgaJpZM4QGjZX .

sdaftuar commented at 1:41 pm on October 29, 2017: member

Fixed a bug and changed the extra peer connection to be exclusive with a feeler, rather than in addition to a feeler.

in src/net_processing.cpp:3095 in f07dc993e7 outdated

3090+            LogPrintf("Potential stale tip detected, will try using extra outbound peer (last tip update: %d seconds ago)\n", time_in_seconds - g_last_tip_update);
3091+            connman->SetTryNewOutboundPeer(true);
3092+        } else {
3093+            connman->SetTryNewOutboundPeer(false);
3094+        }
3095+        m_stale_tip_check_time += STALE_CHECK_INTERVAL;

gmaxwell commented at 5:40 pm on October 29, 2017:

If this loop fails to run when it should– e.g. host it suspended, then later when it does run it will run additional times. I believe this should be = time_in_seconds + STALE_CHECK_INTERVAL.

gmaxwell commented at 5:43 pm on October 29, 2017:

Doubly so because nothing appears to initialize it to a time to begin with…

sdaftuar commented at 5:51 pm on October 29, 2017:

Thanks for catching that; fixed.

in src/net.cpp:1816 in ff5f5cd2b9 outdated

1812@@ -1781,7 +1813,8 @@ void CConnman::ThreadOpenConnections(const std::vector<std::string> connect)
1813         //  * Only make a feeler connection once every few minutes.
1814         //
1815         bool fFeeler = false;
1816-        if (nOutbound >= nMaxOutbound) {
1817+
1818+        if (nOutbound >= nMaxOutbound && !GetTryNewOutboundPeer()) {

TheBlueMatt commented at 6:00 pm on October 29, 2017:

Doesn’t this lead us to make infinite outbound peers if net_processing isnt agressively marking TryNewOutboundPeer false?

sdaftuar commented at 8:03 pm on October 29, 2017:

Yes, I think if GetTryNewOutboundPeer is set, then we keep trying outbound peers up to the semaphore count, at which point I believe we block. Is that problematic for some reason? I’m not super familiar with the design here but my understanding from reading the code was that this was how this logic was originally written (ie before feelers were introduced, with the continue in the if block to avoid consuming the last semaphore value via a non-feeler).

TheBlueMatt commented at 10:35 pm on October 29, 2017:

Grrr, indeed, I suppose this is correct.

in src/net_processing.cpp:3063 in ff5f5cd2b9 outdated

3058+                // Only disconnect a peer that has been connected to us for
3059+                // some reasonable fraction of our check-frequency, to give
3060+                // it time for new information to have arrived.
3061+                // Also don't disconnect any peer we're trying to download a
3062+                // block from.
3063+                constexpr int64_t min_connect_time = EXTRA_PEER_CHECK_INTERVAL / 5;

TheBlueMatt commented at 10:47 pm on October 29, 2017:

Why check if we need to disconnect every 2.5 minutes and only require it being online 30 seconds? Why not just make both 30 seconds (or maybe 45 and 30) so that we dont always end up holding our extra peer for 2.5 minutes?

sdaftuar commented at 11:19 pm on October 29, 2017:

I’m just not really sure how frequent is too-frequent for scheduled callbacks; there’s a tradeoff between wasting time on calls (especially in this framework, where we always have the callback fire to check for eviction, even if we’ve never called SetExtraOutboundPeer()) and gaining precision with how long we allow the extra peer to be connected, and it’s not clear to me how to figure out the optimal point….

TheBlueMatt commented at 0:12 am on October 30, 2017:

The callback in the connman->GetExtraOutboundCount() == 0 case should be almost free, no?

sdaftuar commented at 4:39 pm on October 30, 2017:

Alright, changed to 45 seconds.

TheBlueMatt commented at 10:49 pm on October 29, 2017: member

utACK ff5f5cd2b9ca9c32200ddb85a7d4510c334576b4 (minus test changes and I still prefer a shorter check interval).

in src/net_processing.h:66 in ff5f5cd2b9 outdated

59@@ -55,6 +60,10 @@ class PeerLogicValidation : public CValidationInterface, public NetEventsInterfa
60     bool SendMessages(CNode* pto, std::atomic<bool>& interrupt) override;
61 
62     void ConsiderEviction(CNode *pto, int64_t time_in_seconds);
63+    void CheckForStaleTipAndEvictPeers(const Consensus::Params &consensusParams);
64+    void EvictExtraOutboundPeers(int64_t time_in_seconds);
65+
66+    int64_t m_stale_tip_check_time;

jimpo commented at 4:02 pm on October 30, 2017:

Should this be private?

sdaftuar commented at 4:39 pm on October 30, 2017:

Sounds good, done.

sdaftuar force-pushed on Oct 30, 2017

sdaftuar commented at 5:04 pm on October 30, 2017: member

Updated to address a nit from @jimpo, and reduce the check-for-eviction interval down to 45 seconds (as @TheBlueMatt suggested).

Squashed https://github.com/sdaftuar/bitcoin/commits/11560.0 -> 8a27fdf9ae691eb32b8b5e9f531e16266f8227a8

in src/net_processing.cpp:444 in 8a27fdf9ae outdated

439+{
440+    AssertLockHeld(cs_main);
441+    if (g_last_tip_update == 0) {
442+        g_last_tip_update = GetTime();
443+    }
444+    return g_last_tip_update < GetTime() - consensusParams.nPowTargetSpacing * 3 && mapBlocksInFlight.empty();

achow101 commented at 5:34 pm on October 30, 2017:

Since it is not uncommon for blocks to take more than 30 minutes to be found, maybe this threshold should be higher? It also seems like every node is going to be making that outbound connection at around the same time, maybe there should be some element of randomness to that to avoid some sort of fingerprinting?

sdaftuar commented at 8:01 pm on October 30, 2017:

I think the scheduler behavior will mean that nodes are going to spread out their extra connections over a 10 minute window. Since we already do feeler connections on a (poisson-random) 2 minute interval, and this is taking the place of a feeler, I don’t think this will introduce too much network load.

If there’s a fingerprinting concern, I can try to come up with some fuzzing on when the connection is made (maybe another poisson-random fuzz?), but I’m not sure I understand the specific fingerprinting risk with this logic?

achow101 commented at 0:17 am on October 31, 2017:

but I’m not sure I understand the specific fingerprinting risk with this logic?

I don’t think there is any. I had a fingerprinting concern but then realized that that concern was stupid and stopped being concerned about it.

TheBlueMatt commented at 11:11 pm on October 30, 2017: member

re-utACK 8a27fdf9ae691eb32b8b5e9f531e16266f8227a8

achow101 commented at 0:17 am on October 31, 2017: member

utACK 8a27fdf9ae691eb32b8b5e9f531e16266f8227a8

in src/net_processing.cpp:3048 in 21e8b03a35 outdated

3033+        NodeId worst_peer = -1;
3034+        int64_t oldest_block_announcement = std::numeric_limits<int64_t>::max();
3035+
3036+        LOCK(cs_main);
3037+
3038+        connman->ForEachNode([&](CNode* pnode) {

theuni commented at 3:55 pm on October 31, 2017:

Can we loop through mapNodeState here instead? I’d really like to avoid gluing these layers back together.

Also, I believe that would mean that the nBlocksInFlight != 0 check could be applied here.

sdaftuar commented at 5:18 pm on October 31, 2017:

I just wrote this in my other response, but I don’t think the blocks in flight check is a good thing to sort on – I think there may be edge cases where somehow your slowest peer gets protected, and you would evict someone else.

I’ll see what it looks like when I try to loop mapNodeState instead. I think I’ll still have to call ForNode inside that loop in order to get to the pnode, so I don’t really see the difference in design…? Can you explain a little more what the design goal is for stuff like this?

in src/net_processing.cpp:3068 in 21e8b03a35 outdated

3050+                // some reasonable fraction of our check-frequency, to give
3051+                // it time for new information to have arrived.
3052+                // Also don't disconnect any peer we're trying to download a
3053+                // block from.
3054+                CNodeState &state = *State(pnode->GetId());
3055+                if (time_in_seconds - pnode->nTimeConnected > MINIMUM_CONNECT_TIME && state.nBlocksInFlight == 0) {

theuni commented at 4:08 pm on October 31, 2017:

Doesn’t this mean that if the worst peer is only slightly worse than the second-worst, and the worst has a block in flight, both would stay connected? In that case, I would expect the second-worst to be evicted.

Adding to the comment above, I think that checking for nBlocksInFlight in the worst_peer calculation above avoids this?

sdaftuar commented at 5:09 pm on October 31, 2017:

I don’t think it’s necessary to evict in that situation, since we’ll just try to evict again in 45 seconds, and I thought it was more conservative to not evict. For instance, the block in flight might be at equal work to our tip, rather than more work, and we wouldn’t want to disconnect the older peer in that situation where we’re just downloading an equal-work-chain-tip.

Also if you evict the second-worst peer while a block download is in flight, then it’s possible that the new peer will stall your block download, and you’ll have lost one of your earlier peers unnecessarily.

Anyway I mostly just want to avoid anything that can be gamed, and postponing eviction seems like a safer option than choosing a different peer to evict, I think?

in src/net_processing.cpp:3095 in 21e8b03a35 outdated

3077+{
3078+    if (connman == nullptr) return;
3079+
3080+    int64_t time_in_seconds = GetTime();
3081+
3082+    EvictExtraOutboundPeers(time_in_seconds);

theuni commented at 4:11 pm on October 31, 2017:

Why is this done on a timer rather than making InitializeNode aware that a new node is a replacement?

sdaftuar commented at 4:57 pm on October 31, 2017:

I think it’d be difficult to try to do in InitializeNode, because you don’t know who you are going to want to evict until you’re ready to do the eviction. For example, what happens if one (or all) of your first 8 peers disconnects in between the InitializeNode for your 9th peer, and eviction time?

theuni commented at 11:55 pm on October 31, 2017:

ok, understood

sdaftuar commented at 6:02 pm on October 31, 2017: member

@theuni Pushed a commit that rewrites the ForEachNode() into a loop on mapNodeState.

in src/net_processing.h:36 in 21e8b03a35 outdated

27@@ -27,13 +28,19 @@ static constexpr int64_t HEADERS_DOWNLOAD_TIMEOUT_PER_HEADER = 1000; // 1ms/head
28 static constexpr int32_t MAX_OUTBOUND_PEERS_TO_PROTECT_FROM_DISCONNECT = 4;
29 /** Timeout for (unprotected) outbound peers to sync to our chainwork, in seconds */
30 static constexpr int64_t CHAIN_SYNC_TIMEOUT = 20 * 60; // 20 minutes
31+/** How frequently to check for stale tips, in seconds */
32+static constexpr int64_t STALE_CHECK_INTERVAL = 10 * 60; // 10 minutes
33+/** How frequently to check for extra outbound peers and disconnect, in seconds */
34+static constexpr int64_t EXTRA_PEER_CHECK_INTERVAL = 45;
35+/** Minimum time an outbound-peer-eviction candidate must be connected for, in order to evict, in seconds */
36+static constexpr int64_t MINIMUM_CONNECT_TIME = 30;

theuni commented at 0:03 am on November 1, 2017:

I think we should probably bump this to 60sec, to match the minimum time allowed for the handshake. Or was 30sec a deliberate choice?

That would also give us the handy property that all fSuccessfullyConnected nodes can be disconnected without checking the connection duration.

gmaxwell commented at 0:57 am on November 1, 2017:

disconnected without checking the connection duration.

Huh.. then you’d disconnect someone who completed the handshake fast, without giving them more than a few seconds to transfer blocks.

theuni commented at 1:06 am on November 1, 2017:

Sorry, I meant that in the context of the check here: https://github.com/bitcoin/bitcoin/pull/11560/commits/21e8b03a353abd0f2f4190194f224ad92d6f5bc8#diff-eff7adeaec73a769788bb78858815c91R3055

TheBlueMatt commented at 1:28 pm on November 1, 2017:

I dont really think it matters here - these are separate, we’re seeking a new useful outbound peer, if it doesn’t respond in 30 seconds with a headers message, it probably isnt/wont be useful. Honestly I think we should significantly decrease the headers handshake timeout, there isn’t really a scenario where you should need 60 seconds for a few packets back and forth, even in super lossy environments.

gmaxwell commented at 1:01 am on November 1, 2017: contributor

ACK, I’ve had various versions of this running on a mainnet node for about 5 days now, but with the stall timeout cut down to a fraction of the default value so I could see it in action. Latest versions appear to work as expected. Including managing to keep the latest outbound when a block came over it, dropping instead a different outbound that never got a block.

I’m not too worried about matt’s unbalancing point: this only runs fairly rarely, it runs fairly slowly, and there are 4 protected peers that it won’t unbalance even in the worst case.

One gotcha is that the Turning off message will pretty much never print and is kind of pointless and confusing. Perhaps it should be moved to the inside of settrynewoutbound.

sdaftuar commented at 2:25 pm on November 1, 2017: member

I’m not too worried about matt’s unbalancing point: this only runs fairly rarely, it runs fairly slowly, and there are 4 protected peers that it won’t unbalance even in the worst case.

Actually, the protected-peers logic from the chain sync eviction algorithm wasn’t being used here. After further thought I don’t actually have any good reason not to protect those peers from eviction, and I think the same reasons to protect peers apply here, so I’ve updated this PR accordingly.

I also pushed up a commit that improves the logging.

theuni commented at 2:43 pm on November 1, 2017: member

utACK as-is, but I’d really like to re-work this in master post-merge. We can separate the net vs net_processing states so that they’re not so intertwined.

sdaftuar force-pushed on Nov 1, 2017

sdaftuar commented at 3:40 pm on November 1, 2017: member

Squashed https://github.com/sdaftuar/bitcoin/commits/11560.1 -> dac71cdba95a1b004d2daae322fc026f184ea082

in src/net.h:433 in 45326aaf7a outdated

425@@ -413,6 +426,12 @@ class CConnman
426     std::thread threadOpenAddedConnections;
427     std::thread threadOpenConnections;
428     std::thread threadMessageHandler;
429+
430+    /** flag for deciding to connect to an extra outbound peer,
431+     *  in excess of nMaxOutbound
432+     *  This takes the place of a feeler connection */
433+    CCriticalSection m_cs_outbound_peer;

sipa commented at 3:41 pm on November 1, 2017:

If this CS doesn’t protect more than a single boolean, it’s better to instead just use an std::atomic<bool>.

sdaftuar commented at 3:48 pm on November 1, 2017:

Done

sdaftuar commented at 4:05 pm on November 1, 2017: member

@TheBlueMatt just pointed out to me that ForNode has to walk the vector of CNode’s to find the one we’re looking for, so by rewriting the ForEachNode to a loop on mapNodeState (as suggested by @theuni up here: #11560 (review)), we’re now doing n^2 work instead of n log n work… That seems unfortunate to me – default configurations for listening nodes have > 100 connections right?

I’d like to revert that change.

sdaftuar commented at 5:01 pm on November 1, 2017: member

Pushed a commit (to be squashed) which reverts the loop back to a ForEachNode, rather than walking mapNodeState. Also added a nullptr check on the lookup into mapNodeState, just to be safe.

net: Allow connecting to extra outbound peers 2d4327db19

Track tip update time and last new block announcement from each peer db32a65897

TheBlueMatt commented at 5:23 pm on November 1, 2017: member

re-utACK 79f686b7344b9612123fddec8397a01ab0356713

sdaftuar force-pushed on Nov 1, 2017

sdaftuar commented at 5:56 pm on November 1, 2017: member

Squashed https://github.com/sdaftuar/bitcoin/commits/11560.3 -> 7d9415396b049dd1477febb71930a70870c5a541

theuni commented at 6:23 pm on November 1, 2017: member

utACK 7d94153.

gmaxwell commented at 8:00 pm on November 1, 2017: contributor

re-ut part of my ACK. I’ll move my tests onto this code.

in src/net_processing.cpp:3091 in cfc917ce9d outdated

3077+    }
3078+}
3079+
3080+void PeerLogicValidation::CheckForStaleTipAndEvictPeers(const Consensus::Params &consensusParams)
3081+{
3082+    if (connman == nullptr) return;

sipa commented at 6:14 am on November 2, 2017:

Nit: this is a slightly easier to reason about if you make the member definition for connman CConnman* const connman; (which compiles without other changes)

sdaftuar commented at 12:31 pm on November 2, 2017:

Sounds good.

in src/net_processing.cpp:3052 in cfc917ce9d outdated

3038+
3039+        connman->ForEachNode([&](CNode* pnode) {
3040+            // Ignore non-outbound peers, or nodes marked for disconnect already
3041+            if (!IsOutboundDisconnectionCandidate(pnode) || pnode->fDisconnect) return;
3042+            CNodeState *state = State(pnode->GetId());
3043+            if (state == nullptr) return; // shouldn't be possible, but just in case

sipa commented at 6:19 am on November 2, 2017:

Make assert?

sdaftuar commented at 12:30 pm on November 2, 2017:

I thought our preference was to not assert except in situations where node shutdown is necessary to avoid eg consensus failure or lost funds or something similarly bad? I’ll add the assert if that’s your preference, but I was just nervous about introducing a crash if I got some race condition wrong.

I could add a LogPrinf() here instead perhaps?

TheBlueMatt commented at 4:11 pm on November 2, 2017:

Agree with @sdaftuar, not only should we not be asserting here, but I want to move away from net_processing relying on specific order-of-operations for deletion of CNode/FinalizeNode in net.

sipa commented at 4:17 pm on November 2, 2017:

@TheBlueMatt Perhaps asserting is too harsh, but silently ignoring the fact that an underlying assumption is broken seems bad too.

Also, it seems very reasonable to assume that net does not invoke any net_processing operations before initializing a node, or after finalizing it? That seems easy to guarantee, and very hard for handlers to deal with if untrue.

TheBlueMatt commented at 4:28 pm on November 2, 2017:

@sipa If we moved mapNodeState outside of cs_main (which we should do) plus take #10697, then this would be a legitimate race that doesn’t matter - if FinalizeNode is called while this loop is running, we may hit this condition and evict our 8th peer because our 9th just disconnected, which may happen either way because the peer may disconnect a millisecond after we evict. This is a perfectly fine thing that isn’t a real issue, but re-introducing the strong assumption about when things are removed from vNodes vs when FinalizeNode is called is needless and annoying.

sipa commented at 4:38 pm on November 2, 2017:

Got it, I withdraw my comment.

sipa commented at 6:25 am on November 2, 2017: member

utACK 7d9415396b049dd1477febb71930a70870c5a541

I did not review the test changes.

Connect to an extra outbound peer if our tip is stale

If our tip hasn't updated in a while, that may be because our peers are
not relaying blocks to us that we would consider valid. Allow connection
to an additional outbound peer in that circumstance.

Also, periodically check to see if we are exceeding our target number of
outbound peers, and disconnect the one which has least recently
announced a new block to us (choosing the newest such peer in the case
of tie).

ac7b37cd2b

Add CConnmanTest to mutate g_connman in tests 83df25736e

Add unit test for stale tip checking 626291508c

sdaftuar force-pushed on Nov 2, 2017

sdaftuar commented at 4:42 pm on November 2, 2017: member

Squashed https://github.com/sdaftuar/bitcoin/commits/11560.4 -> 626291508c433488439b662f2e88882048fb59fb

 0$ git diff 626291508c433488439b662f2e88882048fb59fb 7d9415396b049dd1477febb71930a70870c5a541
 1diff --git a/src/net_processing.h b/src/net_processing.h
 2index 0a49972eed..c1f1274f80 100644
 3--- a/src/net_processing.h
 4+++ b/src/net_processing.h
 5@@ -37,7 +37,7 @@ static constexpr int64_t MINIMUM_CONNECT_TIME = 30;
 6 
 7 class PeerLogicValidation : public CValidationInterface, public NetEventsInterface {
 8 private:
 9-    CConnman* const connman;
10+    CConnman* connman;
11 
12 public:
13     explicit PeerLogicValidation(CConnman* connman, CScheduler &scheduler);

gmaxwell commented at 5:28 pm on November 2, 2017: contributor

re-ACK.

MarcoFalke referenced this in commit b0d0f87160 on Nov 2, 2017

MarcoFalke referenced this in commit ddadae3c27 on Nov 2, 2017

MarcoFalke referenced this in commit f8a4529793 on Nov 2, 2017

MarcoFalke referenced this in commit cf93382d7e on Nov 2, 2017

MarcoFalke referenced this in commit 8c8e512896 on Nov 2, 2017

theuni commented at 6:52 pm on November 2, 2017: member

utACK 626291508c433488439b662f2e88882048fb59fb

laanwj merged this on Nov 2, 2017

laanwj closed this on Nov 2, 2017

laanwj referenced this in commit 2f959a5874 on Nov 2, 2017

MarcoFalke referenced this in commit 49bf090185 on Nov 2, 2017

MarcoFalke referenced this in commit 459f2db425 on Nov 2, 2017

MarcoFalke referenced this in commit a607a95d81 on Nov 2, 2017

MarcoFalke referenced this in commit 2ed0647ac1 on Nov 2, 2017

MarcoFalke referenced this in commit fc308a6cdb on Nov 2, 2017

MarcoFalke removed the label Needs backport on Nov 2, 2017

laanwj removed this from the "Review priority 0.15.0.2" column in a project

ghost commented at 1:20 am on February 24, 2018: none

I had a slow sync issue which may have been related to and/or fixed by this… See #12514

I eventually did get the recompile of the GitHub code but the version # did not change.

I hope my issue/resolution helps in fixing this issue or confirming it is fixed :)

codablock referenced this in commit f544e6a642 on Sep 26, 2019

codablock referenced this in commit 58cb7e38f4 on Sep 29, 2019

barrystyle referenced this in commit 0e69e92f08 on Jan 22, 2020

DrahtBot locked this on Feb 15, 2022

Connect to a new outbound peer if our tip is stale #11560

@sdaftuar commented on this pull request.