[addrman, net] Ensure tried collisions resolve, and allow feeler connections to existing outbound netgroups #15486

pull sdaftuar wants to merge 4 commits into bitcoin:master from sdaftuar:2019-02-addrman-collisions changing 3 files +22 −4
  1. sdaftuar commented at 8:29 pm on February 26, 2019: member

    The restriction on outbound peers sharing the same network group is not intended to apply to feeler connections, so fix this.

    This fixes an issue where a tried table collision with an entry to a netgroup we already have an outbound connection to could cause feelers to stop working, because the tried collision buffer (m_tried_collisions) would never be drained.

    Also, ensure that all entries don’t linger in m_tried_collisions by evicting an old entry if its collisions is unresolved after 40 minutes.

  2. [addrman] Improve tried table collision logging 4d834018e3
  3. sdaftuar commented at 8:31 pm on February 26, 2019: member
    I think this should resolve #15484, but it’s not clear to me if this is entirely sufficient to protect against the class of error, or if we should add additional logic to ResolveCollisions() to ensure that collisions get eventually resolved.
  4. DrahtBot commented at 9:37 pm on February 26, 2019: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    No conflicts as of last run.

  5. sdaftuar commented at 9:39 pm on February 26, 2019: member

    Updated with a commit that tries to fix the underlying problem in addrman, by doing eviction anyway after 20 minutes (to prevent m_tried_collisions from never being drained).

    If it looks good, I’ll update the OP of this PR before merge.

  6. fanquake added the label P2P on Feb 26, 2019
  7. gmaxwell commented at 10:43 pm on February 26, 2019: contributor
    This bugfix should probably get tagged for 0.18.
  8. fanquake added this to the milestone 0.18.0 on Feb 26, 2019
  9. [net] feeler connections can be made to outbound peers in same netgroup
    Fixes a bug where feelers could be stuck trying to resolve a collision in the
    tried table that is to an address in the same netgroup as an existing outbound peer.
    
    Thanks to Muoi Tran for the original bug report and detailed debug logs to track
    this down.
    4991e3c813
  10. sdaftuar force-pushed on Feb 27, 2019
  11. [addrman] Ensure collisions eventually get resolved
    After 40 minutes, time out a test-before-evict entry and just evict without
    testing. Otherwise, if we were unable to test an entry for some reason, we
    might break using feelers altogether.
    f71fdda3bc
  12. in src/addrman.h:170 in 1b0621b544 outdated
    165@@ -166,6 +166,9 @@ class CAddrInfo : public CAddress
    166 //! the maximum number of tried addr collisions to store
    167 #define ADDRMAN_SET_TRIED_COLLISION_SIZE 10
    168 
    169+//! the maximum time we'll spend trying to resolve a tried table collision, in seconds
    170+static const int64_t ADDRMAN_TEST_WINDOW = 20*60; // 20 minutes
    


    EthanHeilman commented at 7:13 pm on February 27, 2019:

    Unless I’m missing something this should be set to a value higher than 20 minutes.

    1. A feeler connection is made according to a Poisson distribution with an average of 2 minutes. 50% of the time the interval will take more than 2 minutes. Additionally this interval does not include the connection time. Consider that if each address in the buffer times out.
    2. The collision set buffer can hold at least 10 entries.

    Thus at 20 minutes it is very possible that a value might be added to the collision set buffer and not be tried for more than 20 minutes and then evicted. I would set this to at least 40 minutes. This won’t eliminate the chance an untried entry is evicted, but it will reduce it.

    Sadly I don’t have time to do the math, but if someone else has a few minutes, could someone compute the probability that a address would sit in the collision set for 40 minutes and not be tried. Assume that each prior attempt times out.


    sipa commented at 7:17 pm on February 27, 2019:
    I haven’t looked into this in detail, but wanted to point out that the chance that the time between 2 Poisson-distributed events has only a 1/e = 36.8% chance of being larger than the average interval.

    sipa commented at 7:26 pm on February 27, 2019:

    Oops, used the wrong arguments on Wolfram Alpha in the previous comment.

    Is the question what the probability is that the sum of 10 exponentially-distributed variables (independent and each with an average of 2 minutes) exceeds 20 minutes?

    https://www.wolframalpha.com/input/?i=P(X%3E20)+where+X~GammaDistribution(10,2) says 45.8%.

    Exceeding 40 minutes has a chance of 0.5%.


    sdaftuar commented at 7:46 pm on February 27, 2019:
    How likely is it that we add things to m_tried_collisions more often than once every 2 minutes or so?

    EthanHeilman commented at 8:39 pm on February 27, 2019:

    How likely is it that we add things to m_tried_collisions more often than once every 2 minutes or so?

    Not sure how to compute that probability but I can contrive a scenario where it happens:

    1. Assume a victim nodes tried table is full of good nodes, and the victim nodes new table is full of attacker nodes.

    2. Victim makes 8 outgoing connections and gets unlucky and chooses all 8 from new. This means the victim tries to move these 8 IPs in new to tried causing 8 collisions (remember we assumed tried is full). All 8 of these connections are being made to attacker nodes (since all new addresses are attack controlled).

    3. Attacker then breaks each of these 8 connections so the victim must reconnect. The victim chooses at least 2 nodes from new and causes two new collisions when it tries to add them to tried. Now there are 10 addresses in m_tried_collisions.


    sdaftuar commented at 9:56 pm on February 27, 2019:
    I’ve updated to 40 minutes, thanks.
  13. sdaftuar force-pushed on Feb 27, 2019
  14. sipa commented at 0:53 am on February 28, 2019: member
    utACK f71fdda3bc2e7acd2a8b74e882364866b8b0f55b
  15. in src/addrman.cpp:576 in f71fdda3bc outdated
    568@@ -567,6 +569,13 @@ void CAddrMan::ResolveCollisions_()
    569                         Good_(info_new, false, GetAdjustedTime());
    570                         erase_collision = true;
    571                     }
    572+                } else if (GetAdjustedTime() - info_new.nLastSuccess > ADDRMAN_TEST_WINDOW) {
    573+                    // If the collision hasn't resolved in some reasonable amount of time,
    574+                    // just evict the old entry -- we must not be able to
    575+                    // connect to it for some reason.
    576+                    LogPrint(BCLog::ADDRMAN, "Unable to test; swapping %s for %s in tried table anyway\n", info_new.ToString(), info_old.ToString());
    


    naumenkogs commented at 2:48 am on February 28, 2019:
    I’m not sure, but shouldn’t the variables be swapped?

    EthanHeilman commented at 3:23 pm on February 28, 2019:

    The log message is correct, but agree the log message doesn’t match what the code is doing. I think there is a bug in the code.

    The collision logic puts the entry in tried into m_tried_collisions and then feeler connections tests values in m_tried_collisions. Why is it using the info_new.nLastSuccess shouldn’t it be checking the last success of info_old in m_tried_collisions?

    That is, shouldn’t } else if (GetAdjustedTime() - info_new.nLastSuccess > ADDRMAN_TEST_WINDOW) { be } else if (GetAdjustedTime() - info_old.nLastSuccess > ADDRMAN_TEST_WINDOW) { instead or I am I missing something?


    EthanHeilman commented at 3:26 pm on February 28, 2019:

    Nevermind, I am wrong ignore my last message. The log message and the code is correct.

    } else if (GetAdjustedTime() - info_new.nLastSuccess > ADDRMAN_TEST_WINDOW) { is testing how long it has been since the collision was first found.


    Sjors commented at 6:33 pm on March 1, 2019:

    The word “swap” confused me, also in the log message a few lines above (if (GetAdjustedTime() - info_old.nLastTry > 60)). The phrase “replace [old] with [new]” is more clear.

    Also, this new behaviour contradicts the “when in doubt, don’t evict” rule I mentioned above. Though maybe I’m wrong there? It seems more consistent to just drop the entry.


    EthanHeilman commented at 7:17 pm on March 1, 2019:
    This new behavior does contradict “when in doubt, don’t evict” but only minimally because we give the IP-to-be-tested 40 minutes to be tested. If something prevents us from connecting to the IP-to-be-tested in that 40 minute window something is probably wrong with the IP-to-be-tested.

    sdaftuar commented at 9:02 pm on March 1, 2019:
    The idea is that if we’ve been trying to test the old entry and can’t for some reason, then something is probably wrong with that old entry, so we do want to do the replacement.

    sdaftuar commented at 9:09 pm on March 1, 2019:
    I can edit the language of the log message if this is confusing.
  16. Sjors commented at 4:18 pm on March 1, 2019: member

    Useful background info for context:

  17. in src/addrman.cpp:244 in f71fdda3bc outdated
    238@@ -239,7 +239,9 @@ void CAddrMan::Good_(const CService& addr, bool test_before_evict, int64_t nTime
    239 
    240     // Will moving this address into tried evict another entry?
    241     if (test_before_evict && (vvTried[tried_bucket][tried_bucket_pos] != -1)) {
    242-        LogPrint(BCLog::ADDRMAN, "Collision inserting element into tried table, moving %s to m_tried_collisions=%d\n", addr.ToString(), m_tried_collisions.size());
    243+        // Output the entry we'd be colliding with, for debugging purposes
    244+        auto colliding_entry = mapInfo.find(vvTried[tried_bucket][tried_bucket_pos]);
    245+        LogPrint(BCLog::ADDRMAN, "Collision inserting element into tried table (%s), moving %s to m_tried_collisions=%d\n", colliding_entry != mapInfo.end() ? colliding_entry->second.ToString() : "", addr.ToString(), m_tried_collisions.size());
    


    Sjors commented at 6:21 pm on March 1, 2019:

    when m_tried_collisions is full we don’t add the entry, so this log statement should be inside the if statement below.

    Background question: why is ok that we don’t add this collision to m_tried_collisions, yet also don’t call MakeTried? I assume it’ because when called with test_before_evict, we don’t evict when in doubt.


    EthanHeilman commented at 7:13 pm on March 1, 2019:
    Which line are you referring to with the statement about not adding the collision to m_tried_collisions?

    sdaftuar commented at 9:09 pm on March 1, 2019:

    I don’t think it’s quite as simple as “when in doubt, don’t evict”. I think in the case of a collision we bias ourselves towards not evicting, unless the peer we might evict can’t be connected to via feeler connection.

    My thought is that if m_tried_collisions is ever full, it seems like that is a scenario where we may be under attack; so not replacing things in our tried table (which we assume to have some good peers in it) is safe, even if not optimal. Maybe @EthanHeilman or @gmaxwell has a better intuition for how this works though.


    EthanHeilman commented at 9:37 pm on March 1, 2019:

    I agree with your intuition.

    We bias toward not evicting. The assumption is that if the tried table is contains many evil IPs we have already lost and won’t hear about good IPs. Thus, we assume the tried has many good IPs and that they should not be evicted unless a particular IP in tried is offline and thus no longer provides security.

  18. in src/net.cpp:1768 in f71fdda3bc outdated
    1764@@ -1765,9 +1765,15 @@ void CConnman::ThreadOpenConnections(const std::vector<std::string> connect)
    1765                 addr = addrman.Select(fFeeler);
    1766             }
    1767 
    1768+            // Require outbound connections to be to distinct network groups
    


    Sjors commented at 6:39 pm on March 1, 2019:
    0// Require outbound connections, other than feelers, to be to distinct network groups
    
  19. in src/net.cpp:1774 in f71fdda3bc outdated
    1770+                break;
    1771+            }
    1772+
    1773             // if we selected an invalid address, restart
    1774-            if (!addr.IsValid() || setConnected.count(addr.GetGroup()) || IsLocal(addr))
    1775+            if (!addr.IsValid() || IsLocal(addr)) {
    


    Sjors commented at 6:41 pm on March 1, 2019:
    0// if we selected an invalid or a local address, restart
    
  20. Sjors approved
  21. Sjors commented at 6:58 pm on March 1, 2019: member

    utACK f71fdda, but would be nice to have a test that demonstrates the original bug.

    As @sipa suggested a few years ago, let’s put a link to that paper in a comment somewhere.

  22. [addrman] Improve collision logging and address nits 20e6ea259b
  23. sdaftuar commented at 9:18 pm on March 1, 2019: member
    Addressed nits.
  24. sdaftuar renamed this:
    [net] Allow feeler connections to go to the same netgroup as existing outbound peers
    [addrman, net] Ensure tried collisions resolve, and allow feeler connections to existing outbound netgroups
    on Mar 1, 2019
  25. gmaxwell commented at 11:23 pm on March 1, 2019: contributor
    utACK
  26. Sjors commented at 10:48 am on March 2, 2019: member

    To clarify what I brought up inline: https://github.com/bitcoin/bitcoin/blob/20e6ea259b222b10f066f22695a5f56c52071f63/src/addrman.cpp#L240-L253

    The log statement says we’re moving it to m_tried_collisions, but that’s not true if that’s full. Hence my suggestion to move the log statement inside the if statement (line 245).

    For the case m_tried_collisions is full @sdaftuar wrote:

    I don’t think it’s quite as simple as “when in doubt, don’t evict”. I think in the case of a collision we bias ourselves towards not evicting, unless the peer we might evict can’t be connected to via feeler connection.

    My thought is that if m_tried_collisions is ever full, it seems like that is a scenario where we may be under attack; so not replacing things in our tried table (which we assume to have some good peers in it) is safe, even if not optimal. Maybe @EthanHeilman or @gmaxwell has a better intuition for how this works though.

    And @EthanHeilman replied:

    I agree with your intuition.

    We bias toward not evicting. The assumption is that if the tried table is contains many evil IPs we have already lost and won’t hear about good IPs. Thus, we assume the tried has many good IPs and that they should not be evicted unless a particular IP in tried is offline and thus no longer provides security.

    But a few lines down it seems we’re doing the opposite, namely bias ourselves towards evicting:

    https://github.com/bitcoin/bitcoin/blob/20e6ea259b222b10f066f22695a5f56c52071f63/src/addrman.cpp#L572-L579

    We don’t know anything about this new entry, because we couldn’t connect, so it seems to me it would be more consistent to keep the old entry.

    It’s still much better than getting stuck of course, so utACK 20e6ea2.

  27. gmaxwell commented at 2:14 pm on March 2, 2019: contributor

    We don’t know anything about this new entry,

    When we insert the new entry in tried in the first place, it’s because we’ve actually tried it. When inserting it we may find that it causes us to want to evict an old tried entry (which has also been tried, but not as recently). What the code is supposed to do (post try before evict) is to test the old tried entry before evicting it, in order to give it a second chance to stay around iff its still connectible.

  28. Sjors commented at 2:55 pm on March 2, 2019: member

    Ah, I think I get it now… since we don’t specifically track when things were added to m_tried_collisions, the check info_new.nLastSuccess > ADDRMAN_TEST_WINDOW is just the way we determine when the entry was added to the new table, which coincides with the time we last heard from it, i.e. when it last sent us a version message. And that’s also when it’s added to m_tried_collisions.

    If we receive a version message: https://github.com/bitcoin/bitcoin/blob/20e6ea259b222b10f066f22695a5f56c52071f63/src/net_processing.cpp#L1765

    From an outbound node: https://github.com/bitcoin/bitcoin/blob/20e6ea259b222b10f066f22695a5f56c52071f63/src/net_processing.cpp#L1885

    That’s when we call MarkAddressGood in net.cpp which in turn calls CAddrMan::Good_, which adds it to m_tried_collisions.

  29. sdaftuar commented at 4:39 pm on March 2, 2019: member

    Ah, I think I get it now… since we don’t specifically track when things were added to m_tried_collisions, the check info_new.nLastSuccess > ADDRMAN_TEST_WINDOW is just the way we determine when the entry was added to the new table

    That’s not necessarily when the entry was added to the new table – we can do that when learning about an address, for instance when relayed to us via an addr message from another peer. But otherwise I think your description is correct (in particular, that should be the time that we attempted to put it in the tried table but discovered the collision).

  30. laanwj added the label Needs backport on Mar 4, 2019
  31. naumenkogs commented at 8:25 pm on March 4, 2019: member
    utACK
  32. sdaftuar commented at 9:45 pm on March 6, 2019: member
    I think this is ready to go – anything left that I missed?
  33. laanwj commented at 6:12 am on March 9, 2019: member
    utACK 20e6ea259b222b10f066f22695a5f56c52071f63
  34. laanwj merged this on Mar 9, 2019
  35. laanwj closed this on Mar 9, 2019

  36. laanwj referenced this in commit ff38148808 on Mar 9, 2019
  37. fanquake referenced this in commit ceb5ab34b8 on Mar 9, 2019
  38. fanquake referenced this in commit 78725b4893 on Mar 9, 2019
  39. fanquake referenced this in commit 61248cd6da on Mar 9, 2019
  40. fanquake referenced this in commit d6602c2018 on Mar 9, 2019
  41. fanquake removed the label Needs backport on Mar 9, 2019
  42. laanwj referenced this in commit 561b00a02b on Mar 9, 2019
  43. laanwj referenced this in commit 487f0c3dc9 on Mar 9, 2019
  44. laanwj referenced this in commit 333be7aabf on Mar 9, 2019
  45. laanwj referenced this in commit b80dedb596 on Mar 9, 2019
  46. HashUnlimited referenced this in commit 83838f3488 on Mar 10, 2019
  47. HashUnlimited referenced this in commit e471ac89ba on Mar 10, 2019
  48. HashUnlimited referenced this in commit d8169ed1b0 on Mar 10, 2019
  49. HashUnlimited referenced this in commit 87350af46d on Mar 10, 2019
  50. deadalnix referenced this in commit 6411602e25 on Jun 18, 2020
  51. ftrader referenced this in commit a2c09f364c on Aug 17, 2020
  52. random-zebra referenced this in commit 7a538fde99 on Apr 8, 2021
  53. PastaPastaPasta referenced this in commit 731da4321f on Jun 25, 2021
  54. PastaPastaPasta referenced this in commit cf0280d196 on Jun 25, 2021
  55. PastaPastaPasta referenced this in commit 4e0a478bac on Jun 25, 2021
  56. PastaPastaPasta referenced this in commit fff8f97917 on Jun 26, 2021
  57. DrahtBot locked this on Dec 16, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-10-04 22:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me