Bugfix: don't make collision from "tried" a feeler #19906

naumenkogs commented at 10:41 AM on September 7, 2020: member

The purpose of feelers is to move addrs from "new" to "tried." We should not select a node evicted from collisions, as they are already "tried".

Skip selecting collisions in feeler iterations

If we're going to connect to a feeler (as decided by a timer),
we should not "waste" this iteration on connecting to a collision.
The collisions belong to the "tried" table, so connecting to them
don't accomplish the goal of feelers.

a0d8cff498

Refactor code around feeler connection 0c3e80db42

fanquake added the label P2P on Sep 7, 2020

naumenkogs closed this on Sep 7, 2020

naumenkogs reopened this on Sep 7, 2020

sdaftuar commented at 12:05 AM on September 11, 2020: member

Can you explain a bit more what this PR is trying to accomplish? Is this just about naming conventions, or something else?

naumenkogs commented at 4:11 AM on September 11, 2020: member

@sdaftuar does the first commit message not sound clear?

If we're going to connect to a feeler (as decided by a timer),
we should not "waste" this iteration on connecting to a collision.
The collisions belong to the "tried" table, so connecting to them
don't accomplish the goal of feelers.

We are supposed to attempt a feeler "transition" every 2 minutes. But instead, when a timer hits, we might connect to a tried address (from tried collisions), instead of a feeler. And the next feeler will be attempted after 2 more minutes.

I don't think it's exploitable: an attacker can't force us to do this all the time. It's just currently logically incorrect.

sdaftuar commented at 4:38 PM on September 11, 2020: member

Got it, thanks for clarifying -- I didn't realize that the concern was around the frequency that we pull things out of the new table to potentially move them to the tried table.

I think the underlying question here is how often we want to be making outbound connections to the network (this is a question I was thinking about in the context of #19858 as well). Both uses of FEELER connections today (moving things from new to tried, as well as the test-before-evict logic of resolving tried-table collisions) have the same property that their connections last a very short time. So I think that we could ask ourselves, given a certain rate at which we're comfortable making new outbound connections, how many of those should go to the test-before-evict logic, and how many to the logic that is pulling entries from the new-table?

In practice, it seems to me that tried-table collisions are relatively infrequent, probably less than a few times per day on nodes I've looked at, if I remember right. So the effect of this on the new-table FEELER logic is low. On the other hand, if it were to have a big effect somehow, then changing the logic so that we still make feeler connections to new-table peers at the once-every-2-minute rate will mean we're connecting to the network much more (assuming that we always make the test-before-evict connections when we need to).

So my inclination is that we don't need this change? Or if we do want it, we should just put a rate limit in place so that we can reason about what bounds are achievable in the event of tried-table-collisions.

BTW, the logic presented in this PR is incorrect, I think, because it only tries to resolve tried-table-collisions when we'd make a non-FEELER connection. However, once we have all our outbound peers in place, we never try to make new non-FEELER connections (well, at least until #19858), so I think this basically breaks the test-before-evict logic as drafted now.

in src/net.cpp:1935 in 0c3e80db42

1940 | +            // skip selecting collisions.
1941 | +            // Also require non-feelers to be distinct network groups to preserve connection diversity.
1942 | +            // It's much less relevant for feelers, because it's a frequent procedure
1943 | +            // to sanitize AddrMan (not critical), and selecting from AddrMan is
1944 | +            // sufficiently (for this case) random/difficult-to-Sybil due to the bucketing.
1945 | +            if (conn_type != ConnectionType::FEELER) {

sdaftuar commented at 4:41 PM on September 11, 2020:

It's rare that we get to this line of code -- on startup, when we have not enough outbound peers, we'll hit this. But once we have made our initial 10 outbound connections, then at line 1916 above we'll abort the loop iteration, never getting to this code.

So what would happen is that our FEELERS generate tried-table-collisions which never get resolved, eventually resulting in the old tried table entry getting evicted without an attempt having been made, breaking test-before-evict.

sdaftuar commented at 4:43 PM on September 11, 2020:

Also, making the test-before-evict connections no longer be FEELER breaks the logic in net_processing around disconnecting these peers right when we connect.

in src/net.cpp:1940 in 0c3e80db42

1945 | +            if (conn_type != ConnectionType::FEELER) {
1946 | +                // SelectTriedCollision returns an invalid address if it is empty.
1947 | +                addr = addrman.SelectTriedCollision();
1948 | +                if (!addr.IsValid()) {
1949 | +                    // If no collisions, consider an address from the "tried" table.
1950 | +                    addr = addrman.Select(true);

sdaftuar commented at 4:42 PM on September 11, 2020:

I think true means pull from the new table, not the tried table, so this line and the one below at line 1946 are backwards.

sdaftuar changes_requested

naumenkogs commented at 9:41 AM on September 14, 2020: member

We currently have the following documentation:

    /**
     * Feeler connections are short lived connections used to increase the
     * number of connectable addresses in our AddrMan. Approximately every
     * FEELER_INTERVAL, we attempt to connect to a random address from the new
     * table. If successful, we add it to the tried table.
     */
    FEELER,

It's not accurate since test-before-evict is not about new->tried, and it's not mentioned here. I'm pretty sure it may be inaccurate in other places as well, I didn't even know that this is done via feelers before you told me so here.

Considering what's above, this PR attempted (not so well yet) to separate these two uses, and make feelers do new->tried exclusively.

Now that I know that this other case was actually intended (at least by the author), there are 2 paths we may take:

update the documentation to mention that case as well (+ maybe don't reset feeler timer when resolve collisions, related to #19869)
OR have a new conn type @sdaftuar what do you think? Perhaps I'll sketch what would the latter option look like in the code later.

sdaftuar commented at 1:19 PM on September 14, 2020: member

Now that I know that this other case was actually intended (at least by the author), there are 2 paths we may take:

update the documentation to mention that case as well (+ maybe don't reset feeler timer when resolve collisions, related to #19869)

OR have a new conn type

I agree that it's confusing that the term "feeler" has encompassed both behaviors (perhaps @EthanHeilman has a view as well?) In Eclipse Attacks on Bitcoin’s Peer-to-Peer Network, the "feeler" and "test-before-evict" countermeasures do not seem to use overlapping terminology, so my suggestion is that our code be clearer about mapping to those countermeasure names as well.

My suggestion would be that we rename ConnectionType::FEELER to something like ConnectionType::ADDRMAN_HELPER or ConnectionType::TEST_CONNECT or something to indicate either that the connection is short-lived, or just for the addrman, or both. And then document in our code that we use this brief-addrman-helper conn_type for both the feeler and test-before-evict peer selection logic.

I think we could also rename the feeler timer to be the "addrman-helper-timer", and differentiate the two behaviors in code comments better, if that would make the logic clearer.

I don't think it makes sense to add a second conn_type that has exactly the same properties as the first, just with a different name. Perhaps if we thought those behaviors might diverge in the future, it might make sense; but I don't anticipate that myself.

naumenkogs commented at 8:05 AM on September 15, 2020: member

closing in favour of #19958

naumenkogs closed this on Sep 15, 2020

DrahtBot locked this on Feb 15, 2022