Connect to a new outbound peer if our tip is stale #11560

pull sdaftuar wants to merge 5 commits into bitcoin:master from sdaftuar:2017-10-stale-tip-new-peer changing 8 files +313 −8
  1. sdaftuar commented at 7:44 pm on October 25, 2017: member

    This is an alternative approach to #11534. Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer.

    Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to).

  2. sdaftuar commented at 7:44 pm on October 25, 2017: member
    I’m still working on a unit test for this, but wanted to get some eyes on it for consideration in 0.15.0.2.
  3. in src/net_processing.cpp:2958 in a64bbc03dd outdated
    2953+        // If we have more outbound peers than we target, disconnect one.
    2954+        // Pick the outbound peer that least recently announced
    2955+        // us a new block, with ties broken by choosing the more recent
    2956+        // connection (higher node id)
    2957+        NodeId worst_peer = -1;
    2958+        int64_t oldest_block_announcement = time_in_seconds;
    


    gmaxwell commented at 7:50 pm on October 25, 2017:
    I’d prefer to see this be a int_64 max, so I don’t need to reason about how monotone our clocks aren’t.

    sdaftuar commented at 7:56 pm on October 25, 2017:
    Done
  4. fanquake added the label P2P on Oct 25, 2017
  5. fanquake added this to the milestone 0.15.0.2 on Oct 25, 2017
  6. MarcoFalke added the label Needs backport on Oct 25, 2017
  7. gmaxwell commented at 1:17 am on October 26, 2017: contributor
    Remind me how the scheduler works… when blocks are slow is this going to result in all hosts on the network triggering their extra peer at basically the same time? Or do we not need to insert a bit of randomness in order to spread them out over the checking interval?
  8. sdaftuar commented at 3:58 pm on October 26, 2017: member

    Regarding the scheduler, I mentioned this on IRC to @gmaxwell:

    gmaxwell: my understanding of the scheduler is that it’ll start scheduling callbacks at some point after startup, so i wouldn’t expect the network to be completely synced gmaxwell: also there’s random drift, since the scheduler schedules the next callback N milliseconds after the prior one finishes but worth discussing whether the interval is short enough that there still might be too much concentration of everyone trying to find a new peer

    I updated this PR with a unit test.

  9. laanwj added this to the "Review priority 0.15.0.2" column in a project

  10. sdaftuar force-pushed on Oct 26, 2017
  11. sdaftuar commented at 8:11 pm on October 26, 2017: member
    Needed rebase after #11490.
  12. in src/net.h:263 in 89412654c6 outdated
    258+    void SetTryNewOutboundPeer(bool flag);
    259+    bool GetTryNewOutboundPeer();
    260+
    261+    // Return the number of outbound peers we have in excess of our target (eg,
    262+    // if we previously called SetTryNewOutboundPeer(true), and have since set
    263+    // to false, we may have extra peers that we wish to disconnect).
    


    TheBlueMatt commented at 10:30 pm on October 26, 2017:
    You may want to note the (IMO) critical part of the cpp comment: “Also exclude peers that haven’t finished initial connection handshake yet” ie “this may return a value less than the number of outbound connections - the number of normal outbound connection slots in cases where some outbound connections are not yet fully connected”.
  13. in src/net.cpp:1071 in 89412654c6 outdated
    1067@@ -1068,7 +1068,7 @@ void CConnman::AcceptConnection(const ListenSocket& hListenSocket) {
    1068     SOCKET hSocket = accept(hListenSocket.socket, (struct sockaddr*)&sockaddr, &len);
    1069     CAddress addr;
    1070     int nInbound = 0;
    1071-    int nMaxInbound = nMaxConnections - (nMaxOutbound + nMaxFeeler);
    1072+    int nMaxInbound = nMaxConnections - (nMaxOutbound + nMaxFeeler + nMaxExtraOutbound);
    


    TheBlueMatt commented at 10:55 pm on October 26, 2017:
    I’m confused - it looks like we wont ever have both a feeler and an extra outbound open at one time, why change this?

    sdaftuar commented at 6:32 pm on October 27, 2017:

    it looks like we wont ever have both a feeler and an extra outbound at one time

    We will in general have both a feeler and an extra outbound at the same time. There was an incorrect comment (from a first iteration of this patch) describing the extra outbound peer as “stealing” the feeler connection, but that is not actually true. I’ll delete that comment.

  14. TheBlueMatt commented at 10:56 pm on October 26, 2017: member
    Didnt get very far yet, but given deadlines, figured I’d post early and often :p.
  15. in src/net.cpp:1825 in 89412654c6 outdated
    1821+
    1822+        bool extra_outbound = GetTryNewOutboundPeer();
    1823+
    1824+        int desired_outbound = nMaxOutbound + (extra_outbound ? nMaxExtraOutbound : 0);
    1825+
    1826+        if (nOutbound >= desired_outbound) {
    


    TheBlueMatt commented at 6:17 pm on October 27, 2017:
    Care to add a comment here (or in the comment block above) noting that we will not make a new connection if we have our “extra slot” used by either a feeler or an “extra outbound”, both are not allowed.

    sdaftuar commented at 9:36 pm on October 27, 2017:
    Not sure if this comment was made before we chatted offline, but it is possible to have both a feeler and an “extra” outbound peer simultaneously (just not in the same loop iteration, but that’s nonsensical anyway since we only initiate a single connection in a single loop iteration, regardless).
  16. in src/net.h:430 in 89412654c6 outdated
    425@@ -413,6 +426,10 @@ class CConnman
    426     std::thread threadOpenAddedConnections;
    427     std::thread threadOpenConnections;
    428     std::thread threadMessageHandler;
    429+
    430+    /** flag for stealing a feeler connection to find a new outbound peer */
    


    sdaftuar commented at 6:25 pm on October 27, 2017:
    whoops, this comment is wrong, will fix.
  17. in src/net_processing.cpp:2403 in b258e7cefd outdated
    2398@@ -2382,6 +2399,10 @@ bool static ProcessMessage(CNode* pfrom, const std::string& strCommand, CDataStr
    2399         // because it is set in UpdateBlockAvailability. Some nullptr checks
    2400         // are still present, however, as belt-and-suspenders.
    2401 
    2402+        if (received_new_header && pindexLast->nChainWork > chainActive.Tip()->nChainWork) {
    2403+            nodestate->m_last_block_announcement = GetTime();
    


    TheBlueMatt commented at 6:39 pm on October 27, 2017:
    Shouldn’t there be an equivalent addition for compact block announcements?

    sdaftuar commented at 9:22 pm on October 27, 2017:
    oops, yes.
  18. in src/net_processing.cpp:447 in c6fac39488 outdated
    442+        g_last_tip_update = GetTime();
    443+    }
    444+    return g_last_tip_update < GetTime() - consensusParams.nPowTargetSpacing * 3 && mapBlocksInFlight.empty();
    445+}
    446+
    447+// Requires cs_main
    


    TheBlueMatt commented at 6:55 pm on October 27, 2017:
    Please no more locking documentation - use the self-documenting AssertLockHeld or GUARDED_BY.

    sdaftuar commented at 9:24 pm on October 27, 2017:
    Sounds good
  19. sdaftuar commented at 7:03 pm on October 27, 2017: member
    Pushed up a few small fixups.
  20. in src/net_processing.cpp:2986 in c6fac39488 outdated
    2981+        if (worst_peer != -1) {
    2982+            connman->ForNode(worst_peer, [&](CNode *pnode) {
    2983+                // Only disconnect a peer that has been connected to us for
    2984+                // some reasonable fraction of our check-frequency, to give
    2985+                // it time for new information to have arrived.
    2986+                if (time_in_seconds - pnode->nTimeConnected > STALE_CHECK_INTERVAL/2) {
    


    TheBlueMatt commented at 7:08 pm on October 27, 2017:
    Shouldnt this almost always be true (because we only call this function once every STALE_CHECK_INTERVAL)? Also, I think we can turn this down way more than 5 minutes, if we think we’re behind, and we connect to someone, it shouldn’t take them 5 minutes to send us a header, and if it does, we probably weren’t behind and our existing peers are fine, so we should just disconnect them.

    sdaftuar commented at 9:24 pm on October 27, 2017:

    Maybe almost always, but definitely not always – our outbound peers can initiate disconnects, too, and it might take us a while to find one.

    I’ve pushed up changes to split the peer eviction onto a shorter timer (2.5 minutes), where we only require 30 seconds of connectivity to disconnect.

  21. TheBlueMatt commented at 7:10 pm on October 27, 2017: member
    Didnt review test changes yet.
  22. sdaftuar commented at 8:43 pm on October 27, 2017: member

    @TheBlueMatt raised concerns about holding the 9th connection for too long. One concern would be if we are reducing the number of available inbound slots on the network by too much. Another potentially more serious concern is that if we’re holding the 9th connection long enough for a new block to be found and relayed, then over time we’ll be selecting for peers with the fastest connectivity.

    In the extreme, you could imagine that this algorithm would reduce to nodes grinding away, trying to connect to the peer that was closest to the miners (or the FIBRE network or similar), which is probably not what we want.

    In this case, we only intend for the 9th connection to tell us whether we’re really behind (eg because our existing peers are all broken) or if the network is just slow. We can determine that very quickly after connection, so I think I’m going to change this so that we disconnect extra peers more quickly (ie on a faster timer than every 10 minutes, and requiring a shorter connection time), and reset the TryExtraOutboundPeer flag in CConnman after disconnecting an extra peer, essentially waiting until we decide at the next stale-tip-check that we’re still stale, before trying another 9th peer.

  23. sdaftuar commented at 9:36 pm on October 27, 2017: member
    Addressed @TheBlueMatt’s nits.
  24. in src/net.cpp:1713 in 89412654c6 outdated
    1709+// Exclude peers that are marked for disconnect, or are going to be
    1710+// disconnected soon (eg one-shots and feelers)
    1711+// Also exclude peers that haven't finished initial connection handshake yet
    1712+// (so that we don't decide we're over our desired connection limit, and then
    1713+// evict some peer that has finished the handshake)
    1714+int CConnman::GetExtraOutboundCount()
    


    theuni commented at 0:12 am on October 28, 2017:
    I think this might be greatly simplified by extending CSemaphore to have an adjustable threshold.

    sdaftuar commented at 3:11 pm on October 28, 2017:

    Any thoughts on @TheBlueMatt’s suggestion (mentioned on IRC yesterday) to just use the feeler connection for the extra peer, rather than initiate a potential 10th connection?

    https://botbot.me/freenode/bitcoin-core-dev/2017-10-27/?msg=92828477&page=2

    I’m leaning towards doing that, so nMaxExtraOutbound would go away. I guess this function wouldn’t change though.

  25. sipa commented at 6:22 pm on October 28, 2017: member
    Needs rebase.
  26. sdaftuar force-pushed on Oct 28, 2017
  27. sdaftuar commented at 7:26 pm on October 28, 2017: member
    Squashed and rebased.
  28. theuni commented at 7:50 pm on October 28, 2017: member

    Without seeing the code, I think that would probably be simpler and more obvious.

    I wonder if we could actually combine the behavior with feelers. Launch a feeler every 20 min or so as usual, and rather than disconnecting immediately, replace a worse connection if necessary, and upgrade from feeler to full outbound.

    On Oct 28, 2017 11:11 AM, “Suhas Daftuar” notifications@github.com wrote:

    @sdaftuar commented on this pull request.

    In src/net.cpp https://github.com/bitcoin/bitcoin/pull/11560#discussion_r147556553:

    • return m_try_another_outbound_peer; +}

    +void CConnman::SetTryNewOutboundPeer(bool flag) +{

    • LOCK(m_cs_outbound_peer);
    • m_try_another_outbound_peer = flag; +}

    +// Return the number of peers we have over our outbound connection limit +// Exclude peers that are marked for disconnect, or are going to be +// disconnected soon (eg one-shots and feelers) +// Also exclude peers that haven’t finished initial connection handshake yet +// (so that we don’t decide we’re over our desired connection limit, and then +// evict some peer that has finished the handshake) +int CConnman::GetExtraOutboundCount()

    Any thoughts on @TheBlueMatt https://github.com/thebluematt’s suggestion (mentioned on IRC yesterday) to just use the feeler connection for the extra peer, rather than initiate a potential 10th connection?

    https://botbot.me/freenode/bitcoin-core-dev/2017-10-27/? msg=92828477&page=2

    I’m leaning towards doing that, so nMaxExtraOutbound would go away. I guess this function wouldn’t change though.

    — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bitcoin/bitcoin/pull/11560#discussion_r147556553, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZdEzhjrlXqpR_iPo_-CIcNtvbYKR8cks5sw0QdgaJpZM4QGjZX .

  29. sdaftuar commented at 1:41 pm on October 29, 2017: member
    Fixed a bug and changed the extra peer connection to be exclusive with a feeler, rather than in addition to a feeler.
  30. in src/net_processing.cpp:3095 in f07dc993e7 outdated
    3090+            LogPrintf("Potential stale tip detected, will try using extra outbound peer (last tip update: %d seconds ago)\n", time_in_seconds - g_last_tip_update);
    3091+            connman->SetTryNewOutboundPeer(true);
    3092+        } else {
    3093+            connman->SetTryNewOutboundPeer(false);
    3094+        }
    3095+        m_stale_tip_check_time += STALE_CHECK_INTERVAL;
    


    gmaxwell commented at 5:40 pm on October 29, 2017:
    If this loop fails to run when it should– e.g. host it suspended, then later when it does run it will run additional times. I believe this should be = time_in_seconds + STALE_CHECK_INTERVAL.

    gmaxwell commented at 5:43 pm on October 29, 2017:
    Doubly so because nothing appears to initialize it to a time to begin with…

    sdaftuar commented at 5:51 pm on October 29, 2017:
    Thanks for catching that; fixed.
  31. in src/net.cpp:1816 in ff5f5cd2b9 outdated
    1812@@ -1781,7 +1813,8 @@ void CConnman::ThreadOpenConnections(const std::vector<std::string> connect)
    1813         //  * Only make a feeler connection once every few minutes.
    1814         //
    1815         bool fFeeler = false;
    1816-        if (nOutbound >= nMaxOutbound) {
    1817+
    1818+        if (nOutbound >= nMaxOutbound && !GetTryNewOutboundPeer()) {
    


    TheBlueMatt commented at 6:00 pm on October 29, 2017:
    Doesn’t this lead us to make infinite outbound peers if net_processing isnt agressively marking TryNewOutboundPeer false?

    sdaftuar commented at 8:03 pm on October 29, 2017:
    Yes, I think if GetTryNewOutboundPeer is set, then we keep trying outbound peers up to the semaphore count, at which point I believe we block. Is that problematic for some reason? I’m not super familiar with the design here but my understanding from reading the code was that this was how this logic was originally written (ie before feelers were introduced, with the continue in the if block to avoid consuming the last semaphore value via a non-feeler).

    TheBlueMatt commented at 10:35 pm on October 29, 2017:
    Grrr, indeed, I suppose this is correct.
  32. in src/net_processing.cpp:3063 in ff5f5cd2b9 outdated
    3058+                // Only disconnect a peer that has been connected to us for
    3059+                // some reasonable fraction of our check-frequency, to give
    3060+                // it time for new information to have arrived.
    3061+                // Also don't disconnect any peer we're trying to download a
    3062+                // block from.
    3063+                constexpr int64_t min_connect_time = EXTRA_PEER_CHECK_INTERVAL / 5;
    


    TheBlueMatt commented at 10:47 pm on October 29, 2017:
    Why check if we need to disconnect every 2.5 minutes and only require it being online 30 seconds? Why not just make both 30 seconds (or maybe 45 and 30) so that we dont always end up holding our extra peer for 2.5 minutes?

    sdaftuar commented at 11:19 pm on October 29, 2017:
    I’m just not really sure how frequent is too-frequent for scheduled callbacks; there’s a tradeoff between wasting time on calls (especially in this framework, where we always have the callback fire to check for eviction, even if we’ve never called SetExtraOutboundPeer()) and gaining precision with how long we allow the extra peer to be connected, and it’s not clear to me how to figure out the optimal point….

    TheBlueMatt commented at 0:12 am on October 30, 2017:
    The callback in the connman->GetExtraOutboundCount() == 0 case should be almost free, no?

    sdaftuar commented at 4:39 pm on October 30, 2017:
    Alright, changed to 45 seconds.
  33. TheBlueMatt commented at 10:49 pm on October 29, 2017: member
    utACK ff5f5cd2b9ca9c32200ddb85a7d4510c334576b4 (minus test changes and I still prefer a shorter check interval).
  34. in src/net_processing.h:66 in ff5f5cd2b9 outdated
    59@@ -55,6 +60,10 @@ class PeerLogicValidation : public CValidationInterface, public NetEventsInterfa
    60     bool SendMessages(CNode* pto, std::atomic<bool>& interrupt) override;
    61 
    62     void ConsiderEviction(CNode *pto, int64_t time_in_seconds);
    63+    void CheckForStaleTipAndEvictPeers(const Consensus::Params &consensusParams);
    64+    void EvictExtraOutboundPeers(int64_t time_in_seconds);
    65+
    66+    int64_t m_stale_tip_check_time;
    


    jimpo commented at 4:02 pm on October 30, 2017:
    Should this be private?

    sdaftuar commented at 4:39 pm on October 30, 2017:
    Sounds good, done.
  35. sdaftuar force-pushed on Oct 30, 2017
  36. sdaftuar commented at 5:04 pm on October 30, 2017: member

    Updated to address a nit from @jimpo, and reduce the check-for-eviction interval down to 45 seconds (as @TheBlueMatt suggested).

    Squashed https://github.com/sdaftuar/bitcoin/commits/11560.0 -> 8a27fdf9ae691eb32b8b5e9f531e16266f8227a8

  37. in src/net_processing.cpp:444 in 8a27fdf9ae outdated
    439+{
    440+    AssertLockHeld(cs_main);
    441+    if (g_last_tip_update == 0) {
    442+        g_last_tip_update = GetTime();
    443+    }
    444+    return g_last_tip_update < GetTime() - consensusParams.nPowTargetSpacing * 3 && mapBlocksInFlight.empty();
    


    achow101 commented at 5:34 pm on October 30, 2017:
    Since it is not uncommon for blocks to take more than 30 minutes to be found, maybe this threshold should be higher? It also seems like every node is going to be making that outbound connection at around the same time, maybe there should be some element of randomness to that to avoid some sort of fingerprinting?

    sdaftuar commented at 8:01 pm on October 30, 2017:

    I think the scheduler behavior will mean that nodes are going to spread out their extra connections over a 10 minute window. Since we already do feeler connections on a (poisson-random) 2 minute interval, and this is taking the place of a feeler, I don’t think this will introduce too much network load.

    If there’s a fingerprinting concern, I can try to come up with some fuzzing on when the connection is made (maybe another poisson-random fuzz?), but I’m not sure I understand the specific fingerprinting risk with this logic?


    achow101 commented at 0:17 am on October 31, 2017:

    but I’m not sure I understand the specific fingerprinting risk with this logic?

    I don’t think there is any. I had a fingerprinting concern but then realized that that concern was stupid and stopped being concerned about it.

  38. TheBlueMatt commented at 11:11 pm on October 30, 2017: member
    re-utACK 8a27fdf9ae691eb32b8b5e9f531e16266f8227a8
  39. achow101 commented at 0:17 am on October 31, 2017: member
    utACK 8a27fdf9ae691eb32b8b5e9f531e16266f8227a8
  40. in src/net_processing.cpp:3048 in 21e8b03a35 outdated
    3033+        NodeId worst_peer = -1;
    3034+        int64_t oldest_block_announcement = std::numeric_limits<int64_t>::max();
    3035+
    3036+        LOCK(cs_main);
    3037+
    3038+        connman->ForEachNode([&](CNode* pnode) {
    


    theuni commented at 3:55 pm on October 31, 2017:

    Can we loop through mapNodeState here instead? I’d really like to avoid gluing these layers back together.

    Also, I believe that would mean that the nBlocksInFlight != 0 check could be applied here.


    sdaftuar commented at 5:18 pm on October 31, 2017:

    I just wrote this in my other response, but I don’t think the blocks in flight check is a good thing to sort on – I think there may be edge cases where somehow your slowest peer gets protected, and you would evict someone else.

    I’ll see what it looks like when I try to loop mapNodeState instead. I think I’ll still have to call ForNode inside that loop in order to get to the pnode, so I don’t really see the difference in design…? Can you explain a little more what the design goal is for stuff like this?

  41. in src/net_processing.cpp:3068 in 21e8b03a35 outdated
    3050+                // some reasonable fraction of our check-frequency, to give
    3051+                // it time for new information to have arrived.
    3052+                // Also don't disconnect any peer we're trying to download a
    3053+                // block from.
    3054+                CNodeState &state = *State(pnode->GetId());
    3055+                if (time_in_seconds - pnode->nTimeConnected > MINIMUM_CONNECT_TIME && state.nBlocksInFlight == 0) {
    


    theuni commented at 4:08 pm on October 31, 2017:

    Doesn’t this mean that if the worst peer is only slightly worse than the second-worst, and the worst has a block in flight, both would stay connected? In that case, I would expect the second-worst to be evicted.

    Adding to the comment above, I think that checking for nBlocksInFlight in the worst_peer calculation above avoids this?


    sdaftuar commented at 5:09 pm on October 31, 2017:

    I don’t think it’s necessary to evict in that situation, since we’ll just try to evict again in 45 seconds, and I thought it was more conservative to not evict. For instance, the block in flight might be at equal work to our tip, rather than more work, and we wouldn’t want to disconnect the older peer in that situation where we’re just downloading an equal-work-chain-tip.

    Also if you evict the second-worst peer while a block download is in flight, then it’s possible that the new peer will stall your block download, and you’ll have lost one of your earlier peers unnecessarily.

    Anyway I mostly just want to avoid anything that can be gamed, and postponing eviction seems like a safer option than choosing a different peer to evict, I think?

  42. in src/net_processing.cpp:3095 in 21e8b03a35 outdated
    3077+{
    3078+    if (connman == nullptr) return;
    3079+
    3080+    int64_t time_in_seconds = GetTime();
    3081+
    3082+    EvictExtraOutboundPeers(time_in_seconds);
    


    theuni commented at 4:11 pm on October 31, 2017:
    Why is this done on a timer rather than making InitializeNode aware that a new node is a replacement?

    sdaftuar commented at 4:57 pm on October 31, 2017:
    I think it’d be difficult to try to do in InitializeNode, because you don’t know who you are going to want to evict until you’re ready to do the eviction. For example, what happens if one (or all) of your first 8 peers disconnects in between the InitializeNode for your 9th peer, and eviction time?

    theuni commented at 11:55 pm on October 31, 2017:
    ok, understood
  43. sdaftuar commented at 6:02 pm on October 31, 2017: member
    @theuni Pushed a commit that rewrites the ForEachNode() into a loop on mapNodeState.
  44. in src/net_processing.h:36 in 21e8b03a35 outdated
    27@@ -27,13 +28,19 @@ static constexpr int64_t HEADERS_DOWNLOAD_TIMEOUT_PER_HEADER = 1000; // 1ms/head
    28 static constexpr int32_t MAX_OUTBOUND_PEERS_TO_PROTECT_FROM_DISCONNECT = 4;
    29 /** Timeout for (unprotected) outbound peers to sync to our chainwork, in seconds */
    30 static constexpr int64_t CHAIN_SYNC_TIMEOUT = 20 * 60; // 20 minutes
    31+/** How frequently to check for stale tips, in seconds */
    32+static constexpr int64_t STALE_CHECK_INTERVAL = 10 * 60; // 10 minutes
    33+/** How frequently to check for extra outbound peers and disconnect, in seconds */
    34+static constexpr int64_t EXTRA_PEER_CHECK_INTERVAL = 45;
    35+/** Minimum time an outbound-peer-eviction candidate must be connected for, in order to evict, in seconds */
    36+static constexpr int64_t MINIMUM_CONNECT_TIME = 30;
    


    theuni commented at 0:03 am on November 1, 2017:

    I think we should probably bump this to 60sec, to match the minimum time allowed for the handshake. Or was 30sec a deliberate choice?

    That would also give us the handy property that all fSuccessfullyConnected nodes can be disconnected without checking the connection duration.


    gmaxwell commented at 0:57 am on November 1, 2017:

    disconnected without checking the connection duration.

    Huh.. then you’d disconnect someone who completed the handshake fast, without giving them more than a few seconds to transfer blocks.


    theuni commented at 1:06 am on November 1, 2017:

    TheBlueMatt commented at 1:28 pm on November 1, 2017:
    I dont really think it matters here - these are separate, we’re seeking a new useful outbound peer, if it doesn’t respond in 30 seconds with a headers message, it probably isnt/wont be useful. Honestly I think we should significantly decrease the headers handshake timeout, there isn’t really a scenario where you should need 60 seconds for a few packets back and forth, even in super lossy environments.
  45. gmaxwell commented at 1:01 am on November 1, 2017: contributor

    ACK, I’ve had various versions of this running on a mainnet node for about 5 days now, but with the stall timeout cut down to a fraction of the default value so I could see it in action. Latest versions appear to work as expected. Including managing to keep the latest outbound when a block came over it, dropping instead a different outbound that never got a block.

    I’m not too worried about matt’s unbalancing point: this only runs fairly rarely, it runs fairly slowly, and there are 4 protected peers that it won’t unbalance even in the worst case.

    One gotcha is that the Turning off message will pretty much never print and is kind of pointless and confusing. Perhaps it should be moved to the inside of settrynewoutbound.

  46. sdaftuar commented at 2:25 pm on November 1, 2017: member

    I’m not too worried about matt’s unbalancing point: this only runs fairly rarely, it runs fairly slowly, and there are 4 protected peers that it won’t unbalance even in the worst case.

    Actually, the protected-peers logic from the chain sync eviction algorithm wasn’t being used here. After further thought I don’t actually have any good reason not to protect those peers from eviction, and I think the same reasons to protect peers apply here, so I’ve updated this PR accordingly.

    I also pushed up a commit that improves the logging.

  47. theuni commented at 2:43 pm on November 1, 2017: member
    utACK as-is, but I’d really like to re-work this in master post-merge. We can separate the net vs net_processing states so that they’re not so intertwined.
  48. sdaftuar force-pushed on Nov 1, 2017
  49. sdaftuar commented at 3:40 pm on November 1, 2017: member
    Squashed https://github.com/sdaftuar/bitcoin/commits/11560.1 -> dac71cdba95a1b004d2daae322fc026f184ea082
  50. in src/net.h:433 in 45326aaf7a outdated
    425@@ -413,6 +426,12 @@ class CConnman
    426     std::thread threadOpenAddedConnections;
    427     std::thread threadOpenConnections;
    428     std::thread threadMessageHandler;
    429+
    430+    /** flag for deciding to connect to an extra outbound peer,
    431+     *  in excess of nMaxOutbound
    432+     *  This takes the place of a feeler connection */
    433+    CCriticalSection m_cs_outbound_peer;
    


    sipa commented at 3:41 pm on November 1, 2017:
    If this CS doesn’t protect more than a single boolean, it’s better to instead just use an std::atomic<bool>.

    sdaftuar commented at 3:48 pm on November 1, 2017:
    Done
  51. sdaftuar commented at 4:05 pm on November 1, 2017: member

    @TheBlueMatt just pointed out to me that ForNode has to walk the vector of CNode’s to find the one we’re looking for, so by rewriting the ForEachNode to a loop on mapNodeState (as suggested by @theuni up here: #11560 (review)), we’re now doing n^2 work instead of n log n work… That seems unfortunate to me – default configurations for listening nodes have > 100 connections right?

    I’d like to revert that change.

  52. sdaftuar commented at 5:01 pm on November 1, 2017: member
    Pushed a commit (to be squashed) which reverts the loop back to a ForEachNode, rather than walking mapNodeState. Also added a nullptr check on the lookup into mapNodeState, just to be safe.
  53. net: Allow connecting to extra outbound peers 2d4327db19
  54. Track tip update time and last new block announcement from each peer db32a65897
  55. TheBlueMatt commented at 5:23 pm on November 1, 2017: member
    re-utACK 79f686b7344b9612123fddec8397a01ab0356713
  56. sdaftuar force-pushed on Nov 1, 2017
  57. sdaftuar commented at 5:56 pm on November 1, 2017: member
    Squashed https://github.com/sdaftuar/bitcoin/commits/11560.3 -> 7d9415396b049dd1477febb71930a70870c5a541
  58. theuni commented at 6:23 pm on November 1, 2017: member
    utACK 7d94153.
  59. gmaxwell commented at 8:00 pm on November 1, 2017: contributor
    re-ut part of my ACK. I’ll move my tests onto this code.
  60. in src/net_processing.cpp:3091 in cfc917ce9d outdated
    3077+    }
    3078+}
    3079+
    3080+void PeerLogicValidation::CheckForStaleTipAndEvictPeers(const Consensus::Params &consensusParams)
    3081+{
    3082+    if (connman == nullptr) return;
    


    sipa commented at 6:14 am on November 2, 2017:
    Nit: this is a slightly easier to reason about if you make the member definition for connman CConnman* const connman; (which compiles without other changes)

    sdaftuar commented at 12:31 pm on November 2, 2017:
    Sounds good.
  61. in src/net_processing.cpp:3052 in cfc917ce9d outdated
    3038+
    3039+        connman->ForEachNode([&](CNode* pnode) {
    3040+            // Ignore non-outbound peers, or nodes marked for disconnect already
    3041+            if (!IsOutboundDisconnectionCandidate(pnode) || pnode->fDisconnect) return;
    3042+            CNodeState *state = State(pnode->GetId());
    3043+            if (state == nullptr) return; // shouldn't be possible, but just in case
    


    sipa commented at 6:19 am on November 2, 2017:
    Make assert?

    sdaftuar commented at 12:30 pm on November 2, 2017:

    I thought our preference was to not assert except in situations where node shutdown is necessary to avoid eg consensus failure or lost funds or something similarly bad? I’ll add the assert if that’s your preference, but I was just nervous about introducing a crash if I got some race condition wrong.

    I could add a LogPrinf() here instead perhaps?


    TheBlueMatt commented at 4:11 pm on November 2, 2017:
    Agree with @sdaftuar, not only should we not be asserting here, but I want to move away from net_processing relying on specific order-of-operations for deletion of CNode/FinalizeNode in net.

    sipa commented at 4:17 pm on November 2, 2017:

    @TheBlueMatt Perhaps asserting is too harsh, but silently ignoring the fact that an underlying assumption is broken seems bad too.

    Also, it seems very reasonable to assume that net does not invoke any net_processing operations before initializing a node, or after finalizing it? That seems easy to guarantee, and very hard for handlers to deal with if untrue.


    TheBlueMatt commented at 4:28 pm on November 2, 2017:
    @sipa If we moved mapNodeState outside of cs_main (which we should do) plus take #10697, then this would be a legitimate race that doesn’t matter - if FinalizeNode is called while this loop is running, we may hit this condition and evict our 8th peer because our 9th just disconnected, which may happen either way because the peer may disconnect a millisecond after we evict. This is a perfectly fine thing that isn’t a real issue, but re-introducing the strong assumption about when things are removed from vNodes vs when FinalizeNode is called is needless and annoying.

    sipa commented at 4:38 pm on November 2, 2017:
    Got it, I withdraw my comment.
  62. sipa commented at 6:25 am on November 2, 2017: member

    utACK 7d9415396b049dd1477febb71930a70870c5a541

    I did not review the test changes.

  63. Connect to an extra outbound peer if our tip is stale
    If our tip hasn't updated in a while, that may be because our peers are
    not relaying blocks to us that we would consider valid. Allow connection
    to an additional outbound peer in that circumstance.
    
    Also, periodically check to see if we are exceeding our target number of
    outbound peers, and disconnect the one which has least recently
    announced a new block to us (choosing the newest such peer in the case
    of tie).
    ac7b37cd2b
  64. Add CConnmanTest to mutate g_connman in tests 83df25736e
  65. Add unit test for stale tip checking 626291508c
  66. sdaftuar force-pushed on Nov 2, 2017
  67. sdaftuar commented at 4:42 pm on November 2, 2017: member

    Squashed https://github.com/sdaftuar/bitcoin/commits/11560.4 -> 626291508c433488439b662f2e88882048fb59fb

     0$ git diff 626291508c433488439b662f2e88882048fb59fb 7d9415396b049dd1477febb71930a70870c5a541
     1diff --git a/src/net_processing.h b/src/net_processing.h
     2index 0a49972eed..c1f1274f80 100644
     3--- a/src/net_processing.h
     4+++ b/src/net_processing.h
     5@@ -37,7 +37,7 @@ static constexpr int64_t MINIMUM_CONNECT_TIME = 30;
     6 
     7 class PeerLogicValidation : public CValidationInterface, public NetEventsInterface {
     8 private:
     9-    CConnman* const connman;
    10+    CConnman* connman;
    11 
    12 public:
    13     explicit PeerLogicValidation(CConnman* connman, CScheduler &scheduler);
    
  68. gmaxwell commented at 5:28 pm on November 2, 2017: contributor
    re-ACK.
  69. MarcoFalke referenced this in commit b0d0f87160 on Nov 2, 2017
  70. MarcoFalke referenced this in commit ddadae3c27 on Nov 2, 2017
  71. MarcoFalke referenced this in commit f8a4529793 on Nov 2, 2017
  72. MarcoFalke referenced this in commit cf93382d7e on Nov 2, 2017
  73. MarcoFalke referenced this in commit 8c8e512896 on Nov 2, 2017
  74. theuni commented at 6:52 pm on November 2, 2017: member
    utACK 626291508c433488439b662f2e88882048fb59fb
  75. laanwj merged this on Nov 2, 2017
  76. laanwj closed this on Nov 2, 2017

  77. laanwj referenced this in commit 2f959a5874 on Nov 2, 2017
  78. MarcoFalke referenced this in commit 49bf090185 on Nov 2, 2017
  79. MarcoFalke referenced this in commit 459f2db425 on Nov 2, 2017
  80. MarcoFalke referenced this in commit a607a95d81 on Nov 2, 2017
  81. MarcoFalke referenced this in commit 2ed0647ac1 on Nov 2, 2017
  82. MarcoFalke referenced this in commit fc308a6cdb on Nov 2, 2017
  83. MarcoFalke removed the label Needs backport on Nov 2, 2017
  84. laanwj removed this from the "Review priority 0.15.0.2" column in a project

  85. ghost commented at 1:20 am on February 24, 2018: none

    I had a slow sync issue which may have been related to and/or fixed by this… See #12514

    I eventually did get the recompile of the GitHub code but the version # did not change.

    I hope my issue/resolution helps in fixing this issue or confirming it is fixed :)

  86. codablock referenced this in commit f544e6a642 on Sep 26, 2019
  87. codablock referenced this in commit 58cb7e38f4 on Sep 29, 2019
  88. barrystyle referenced this in commit 0e69e92f08 on Jan 22, 2020
  89. DrahtBot locked this on Feb 15, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-19 00:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me