No feeler connection is made when our tip is stale #15484

issue muoitranduc openend this issue on February 26, 2019
  1. muoitranduc commented at 1:40 pm on February 26, 2019: none

    I run several Bitcoin nodes on Amazon EC2 and notice some of them stop making feeler connections after a while. I checked the debug.log of the nodes and see that “Potential stale tip detected, will try using extra outbound peer” messages are printed out frequently, up to 10+ per day. The messages appear more frequently when the feeler connections disappear.

    I expect the nodes still make feeler connections in between events of the 9th outgoing connection made because of the stale tip. I presume SetTryNewOutboundPeer() incorrectly set the flags somewhere so that fFeeler is always false.

    The problem happens naturally for some nodes after running 20-30 days. All nodes are running Bitcoin core 0.17.0 by default non-stop on Ubuntu 18.04 LTS and with some minimal logging messages I added myself.
    I tried to restart one node and the problem is gone. I guess because outgoing peers have been chosen again when node restarts.

  2. fanquake added the label P2P on Feb 26, 2019
  3. sdaftuar commented at 1:51 pm on February 26, 2019: member
    Thanks for the bug report. Can you share a debug.log from a node affected by the issue? If there might be sensitive information in it (wallet-related logs, IP addresses, etc) I am mostly interested in the debug messages around the relevant networking code, consisting of lines that include: UpdateTip, Potential stale tip, disconnecting extra outbound peer, keeping outbound peer, setting try another outbound peer, and Making feeler connection.
  4. muoitranduc commented at 2:25 pm on February 26, 2019: none
    Thanks for the response. As I described in the issue, I only run the default client only with some additional logging messages. I didn’t turn on the debug=net, unfortunately. I was looking for how IPs are inserted into new and tried table, actually, hence the logging messages are quite different. My logging messages include almost all messages you requested actually, except “setting try another outbound peer”. Here is the link to the shortened debug.log: https://www.dropbox.com/s/769tyk1ftfpv3yh/shortened_debug.log?dl=0
  5. sdaftuar commented at 2:32 pm on February 26, 2019: member
    I don’t see any UpdateTip messages in there, could you please re-post with those included?
  6. muoitranduc commented at 2:35 pm on February 26, 2019: none
    I checked the original debug.log. There is no “UpdateTip” message. Do you expect to see log message record by the following line? https://github.com/bitcoin/bitcoin/blob/master/src/validation.cpp#L2263 If so, there is no such message in the original debug.log as well.
  7. MarcoFalke commented at 2:36 pm on February 26, 2019: member
    Could you please include the Bitcoin Core commit you built from and what local changes you made?
  8. sdaftuar commented at 2:53 pm on February 26, 2019: member
    I think whatever bug is going on here is occurring on my own node as well; thanks again for this report, I’ll take a look at my local logs and see what I can find.
  9. muoitranduc commented at 2:59 pm on February 26, 2019: none
    Great, thanks! FYI, I built from 0.17.0 release: https://github.com/bitcoin/bitcoin/archive/v0.17.0.zip I only added a few logging messages in addrman.cpp, net_processing.cpp and net.cpp.
  10. sdaftuar commented at 4:45 pm on February 26, 2019: member

    ping @EthanHeilman

    I think this is a buggy interaction between addrman and net relating to collisions and feelers. If we encounter a collision in the tried table with something in a network group that we already have an outbound connection to, then we can get stuck with an entry in m_tried_collisions that will never be resolved.

    In the debug.log you posted, I noticed that the feelers stopped regularly happening when the addrman encountered a collision in the tried table:

    02019-01-30T18:43:09Z ResolveCollisions_, 78.108.187.246:8333, Collision with 35.178.76.8:8333 in tried table at Bucket = 181 and Slot = 28
    

    At the time that this collision with 35.178.76.8 occurred, your node was already connected outbound to another node with the same network group:

    02019-01-27T02:17:23Z ProcessMessage, VERACK, New outgoing connection: peeraddr=35.178.95.164:8333 , number of incoming: 68 , number of outgoing: 8 
    

    Here’s the relevant code snippet from CConnMan::ThreadOpenConnections() that shows what happens:

    https://github.com/bitcoin/bitcoin/blob/master/src/net.cpp#L1755-L1801

    In particular, in our loop where we try to send a feeler connection to something that had a collision, we break if the address is to something in a network group we’re already connected to. However the collision cannot be resolved until we try to connect to the ip address and fail; here is the relevant code from the addrman’s ResolveCollisions():

    https://github.com/bitcoin/bitcoin/blob/master/src/addrman.cpp#L551-L571

    So we get stuck, with no way to try more feelers, and no way to connect to the one the feeler code would like to test.

    I’m not sure what the best fix is; I think we could either connect a feeler anyway to things in the same network group, or we could make it so that nothing stays in the m_tried_collisions set for too long before being resolved somehow, or both… @muoitranduc Your debug.log that shows both colliding entries was very helpful for tracking this down!

  11. EthanHeilman commented at 9:11 pm on February 26, 2019: contributor

    I think we could either connect a feeler anyway to things in the same network group, or we could make it so that nothing stays in the m_tried_collisions set for too long before being resolved somehow, or both…

    I agree with doing both fixes:

    1. Having it ignore groups is the right behavior.
    2. However we need a way to clear out an addr that is repeatedly being test and not being resolved.
  12. adamjonas commented at 12:13 pm on May 6, 2020: member
    Following up on whether #15486 resolved this issue or if additional logic to ResolveCollisions() is needed as per @sdaftuar’s comment. cc @muoitranduc
  13. deadalnix referenced this in commit 6411602e25 on Jun 18, 2020
  14. ftrader referenced this in commit a2c09f364c on Aug 17, 2020
  15. adamjonas commented at 2:27 pm on October 20, 2020: member

    From email with @muoitranduc:

    “…we couldn’t reproduce the corner cases that we faced with our newly-spawned nodes. I understand that the logic has been modified even further with the anchor connections. Hence, I guess it is safe to close the issue because its only purpose is to guarantee feeler connections can be established.”

    This looks like it can be closed.

  16. fanquake closed this on Oct 20, 2020

  17. DrahtBot locked this on Feb 15, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-10-04 22:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me