p2p: Request NOTFOUND transactions immediately from other outbound peers, when possible

sdaftuar commented at 8:19 pm on February 28, 2019: member

Currently when a peer asks for a transaction that we cannot (or will not) deliver, we send a NOTFOUND message for the given txid. However, our software currently ignores the NOTFOUND messages we receive from others – if other peers have announced a transaction, we wait until the transaction request times out before requesting from our next peer.

Improve this by immediately requesting the transaction from one of the outbound peers who have recently announced it (if any).

DrahtBot commented at 9:31 pm on February 28, 2019: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#16279 (Return the AcceptBlock CValidationState directly in ProcessNewBlock by TheBlueMatt)
#16197 (net: Use mockable time for tx download by MarcoFalke)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

in src/net_processing.cpp:853 in 059e04c03c outdated

849+        if (peer_loc != g_outbound_peers.size()-1) {
850+            // Swap this peer with the last one
851+            g_outbound_peers[peer_loc] = g_outbound_peers.back();
852+        }
853+        g_outbound_peers.pop_back();
854+    }

jamesob commented at 10:31 pm on February 28, 2019:

Seems like this could be a lot more succinct with vector::erase - is there some reason to do it this way?

sdaftuar commented at 3:31 pm on March 1, 2019:

I guess I have an aversion to vector::erase because it requires moving all the other entries. Possibly that concern doesn’t make sense for such a small vector?

But the bigger issue here is why this is a vector at all – originally I had this as a set, but then I changed it to be a vector in order to make the logic around when to drain the notfound queue for each peer make sense. In short, I wanted to uniformly at random assign a notfound transaction to an outbound peer that announced it, but I didn’t have a good way to do that, so instead I assign it to all peers, but then have a (hacky) counter in SendMessages and each outbound peer will check that counter to decide when to drain its notfound queue.

(I’m hoping that reviewers will give me a better idea of how to implement that logic than what I’ve got so far…)

in src/net_processing.cpp:356 in 0f4ceef389 outdated

351@@ -338,6 +352,11 @@ struct CNodeState {
352      *   peers.
353      */
354     struct TxDownloadState {
355+        /* Notifications of NOTFOUND transactions (from other peers) which
356+         * this peer has recently announced. txid->time in micros

naumenkogs commented at 11:18 pm on February 28, 2019:

This comment seems outdated

sdaftuar commented at 8:21 pm on March 1, 2019:

Fixed.

fanquake added the label P2P on Feb 28, 2019

in src/net_processing.cpp:3186 in 0f4ceef389 outdated

3183+                    // waiting for the download timeout to expire before
3184+                    // re-requesting a transaction.
3185+                    if (!AlreadyHave(inv)) {
3186+                        for (auto it = g_outbound_peers.begin(); it != g_outbound_peers.end(); ++it) {
3187+                            if (*it == pfrom->GetId()) continue;
3188+                            CNodeState *state = State(*it);

practicalswift commented at 7:29 am on March 1, 2019:

This state shadows the state in the outer scope.

sdaftuar commented at 8:22 pm on March 1, 2019:

Fixed.

in src/net_processing.cpp:748 in 0f4ceef389 outdated

743@@ -724,6 +744,41 @@ void RequestTx(CNodeState* state, const uint256& txid, int64_t nNow) EXCLUSIVE_L
744     peer_download_state.m_tx_process_time.emplace(process_time, txid);
745 }
746 
747+// Given a txid and a peer that we'd like to consider downloading a transaction
748+// from: add an appropriate INV nessage to get_data if it is time to request

practicalswift commented at 7:31 am on March 1, 2019:

I’ve got an important nessage for you: this should be “message”.

sdaftuar commented at 8:22 pm on March 1, 2019:

Fixed.

in src/net_processing.cpp:751 in 0f4ceef389 outdated

743@@ -724,6 +744,41 @@ void RequestTx(CNodeState* state, const uint256& txid, int64_t nNow) EXCLUSIVE_L
744     peer_download_state.m_tx_process_time.emplace(process_time, txid);
745 }
746 
747+// Given a txid and a peer that we'd like to consider downloading a transaction
748+// from: add an appropriate INV nessage to get_data if it is time to request
749+// the transaction from the peer, and update the appropriate tx download state
750+// (both the global state and the peer's state). Send a getdata message if the
751+// get_data vector grows too large.

jamesob commented at 3:09 pm on March 1, 2019:

Nice doc. To get doxygen to pick this up, you’ll have to do

0//! this
1/// or this
2/**
3 * or this
4 */

sdaftuar commented at 8:22 pm on March 1, 2019:

Done.

in src/net_processing.cpp:3242 in 0f4ceef389 outdated

3163+        CNodeState *state = State(pfrom->GetId());
3164+        std::vector<CInv> vInv;
3165+        vRecv >> vInv;
3166+        if (vInv.size() <= MAX_INV_SZ) {
3167+            for (CInv &inv : vInv) {
3168+                if (inv.type == MSG_TX || inv.type == MSG_WITNESS_TX) {

jamesob commented at 3:19 pm on March 1, 2019:

This is just a style/readability thing, but you can avoid indenting the next ~23 lines by changing this to

0if (inv.type != MSG_TX && inv.type != MSG_WITNESS_TX) {
1    continue;
2}

sdaftuar commented at 8:22 pm on March 1, 2019:

Will leave as-is – in this case I like being able to tell that the code block exactly matches the if condition, rather than matching the opposite of an if condition above it that would have caused the loop iteration to be skipped.

in src/net_processing.cpp:3234 in 0f4ceef389 outdated

3155@@ -3081,8 +3156,46 @@ bool static ProcessMessage(CNode* pfrom, const std::string& strCommand, CDataStr
3156     }
3157 
3158     if (strCommand == NetMsgType::NOTFOUND) {

jamesob commented at 3:29 pm on March 1, 2019:

Aid for fellow reviewers: NOTFOUND messages are assembled here and their structure exactly resembles INVs, which is why we can deserialize using cInv below.

sdaftuar commented at 3:32 pm on March 1, 2019: member

Thanks for the nits; I’ll address them later, but in the meantime I’m open to suggestions on how to better implement the random assignment of notfound transactions to the outbound peers that have announced them (I think the logic I have here should work, but it’s a bit gross).

in src/net_processing.cpp:3246 in 0f4ceef389 outdated

3166+        if (vInv.size() <= MAX_INV_SZ) {
3167+            for (CInv &inv : vInv) {
3168+                if (inv.type == MSG_TX || inv.type == MSG_WITNESS_TX) {
3169+                    state->m_tx_download.m_tx_announced.erase(inv.hash);
3170+                    auto in_flight_it = state->m_tx_download.m_tx_in_flight.find(inv.hash);
3171+                    if (in_flight_it == state->m_tx_download.m_tx_in_flight.end()) {

jamesob commented at 3:55 pm on March 1, 2019:

IMO more clearly written as

0if (!state->m_tx_download.m_tx_in_flight.count(inv.hash)) {
1...
2}

sdaftuar commented at 8:23 pm on March 1, 2019:

Oops, I meant to use the iterator in the erase() call, saving an extra map lookup. Just did that in the latest commit.

in src/net_processing.cpp:4010 in 0f4ceef389 outdated

4005+            // we might impose).
4006+            // Our algorithm is to keep a global SendMessages() counter and
4007+            // every outbound peer gets to drain its notfound queue when the
4008+            // counter matches its location.
4009+            assert(g_outbound_peers.size() > 0);
4010+            if (g_outbound_peers[not_found_counter % g_outbound_peers.size()] == pto->GetId()) {

jamesob commented at 4:15 pm on March 1, 2019:

Correct me if I’m wrong here, but isn’t this equivalent to doing something like

0if (GetRandInt(/* nMax= */g_outbound_peers.size()) == 0) {
1...
2}

?

If you think the potential performance difference (which IMO is probably negligible when shoulder-to-shoulder with network IO) is okay, just doing something like that would allow you to keep g_outbound_peers a set and preserve the algorithm.

sdaftuar commented at 4:55 pm on March 1, 2019:

Doing a 1 in N draw each time would introduce a bias to peers that are earlier in the loop, versus ones that are later in the loop. Imagine if N=2, and we have two peers, peer0 and peer1. peer0 would be selected with probability 1/2 + 1/8 + 1/32 + … = 2/3. If N=8, I think the first peer would have a 19% chance of being selected.

sdaftuar commented at 5:03 pm on March 1, 2019:

I just realized that my hacky implementation here is broken too – the not_found_counter will basically increment by the number of peers we have, so if the number of peers we have is equal the number of outbound peers we have (a common state for non-listening nodes) then we’ll always be requesting notfound’s from the first peer.

jamesob commented at 5:46 pm on March 1, 2019:

I’m not sure I see how there’s any bias based on the ordering of peers - the GetRandInt thing is an independent trial across peers, though of course we may end up draining multiple peers’ m_not_found per single ThreadMessageHandler loop if two peers both get 1/num_outbounds lucky. Is there a reason we’d want to avoid that?

It seems like there might be a bias based upon which peer we select to queue the m_not_found message on (peers earlier in g_outbound_peers will get selected first more often), but I’m not sure how the order in which we flush m_not_found makes a difference.

sdaftuar commented at 8:23 pm on March 1, 2019:

Ok I redid this to be (hopefully) better, thanks.

jamesob commented at 4:17 pm on March 1, 2019: member

Feel free to ignore the nits, but curious what you think about the use of GetRandInt in lieu of the vector hackery.

sdaftuar force-pushed on Mar 1, 2019

sdaftuar commented at 8:25 pm on March 1, 2019: member

@jamesob Thanks for the helpful suggestions. I rewrote this with g_outbound_peers being a set, and moving the global sequence number needed for peer selection to be in the interface between net and net_processing. (It seems to me like the idea of having synchronization between loop iterations in ThreadMessageHandler is something that is occasionally useful in net_processing, so I hope the interface change there is acceptable.)

sdaftuar force-pushed on Mar 1, 2019

jamesob commented at 9:35 pm on March 1, 2019: member

Looks good! I’ll write up a related functional test this weekend unless you have a burning desire to do so.

naumenkogs commented at 9:56 pm on March 1, 2019: member

I must be misunderstanding something.

Imagine we have 2 peers: [p1, p2] Start with seq_num = 1

SendMessages(p1, 1 % 2 = 1): false SendMessages(p2, 1 % 2 = 1): false

SendMessages(p1, 2 % 2 = 0): true SendMessages(p2, 2 % 2 = 0): true

So, p1 will be always called before p2 (and much likely to request something from p1)?

jamesob commented at 10:02 pm on March 1, 2019: member

@naumenkogs it’s not the result of the modulus that is evaluated to determine whether to notfound flush for a given peer; sequence_number % num_outbounds is used as an index into the set of outbounds which then determines the peer whose turn it is to flush.

sdaftuar commented at 10:04 pm on March 1, 2019: member

Let’s say we have 2 peers, p1 and p2 in g_outbound_peers, and let’s assume the std::set comparator puts them in that order when iterating through g_outbound_peers. Let’s also say both of them have something in their respective notfound queues.

When the sequence number is 1, then 1 % 2 == 1, so we look at g_outbound_peers.begin() + 1, and we get p2’s nodeid. When we call SendMessage on p1, the nodeid will not match, so its queue will not be drained. When we call SendMessages on p2, the condition will be true, and we’ll drain the notfound queue. On the next loop iteration, the conditions flip.

(I’m open to simpler logic that accomplishes the same thing!)

naumenkogs commented at 10:07 pm on March 1, 2019: member

@sdaftuar @jamesob I see, you’re right, thanks.

I was not a big fan of extra sequence_numbers, but based on the discussion above and some thinking I don’t see a cleaner way to select a peer in a fair way. This sequence_number is also likely to be useful somewhere else in future.

utACK

jamesob commented at 7:41 pm on March 5, 2019: member

utACK https://github.com/bitcoin/bitcoin/pull/15505/commits/f5dbb49c8922cb26d05862b7aacf161bde8ffc5a

I was going to write a functional test for this, but I’m not sure there’s a sensible way to do that given that it looks like we can’t force P2PDataStore clients to be outbound connections for nodes under test. (see also: #14210)

If anyone can think of a good manual test plan, I’m happy to carry it out.

DrahtBot added the label Needs rebase on May 4, 2019

sdaftuar commented at 3:16 pm on June 5, 2019: member

Closing for now.

sdaftuar closed this on Jun 5, 2019

sdaftuar reopened this on Jun 12, 2019

[net] Add a global sequence number to SendMessages()

Keep an incrementing counter between loop iterations in ThreadMessageHandler.

This allows for synchronization between loop iterations in SendMessages.

88d73ab65c

[p2p] Maintain a set of outbound peers 7249cad634

Refactor transaction getdata request logic b8ca596138

Request NOTFOUND transactions immediately

Thanks to James O'Beirne for some of the code in this patch.

753883512a

sdaftuar force-pushed on Jun 16, 2019

DrahtBot removed the label Needs rebase on Jun 18, 2019

sdaftuar commented at 2:28 pm on July 22, 2019: member

Seems like not much interest here, closing for now.

sdaftuar closed this on Jul 22, 2019

jkczyz commented at 11:41 pm on July 30, 2019: contributor

Rather than using a sequence number to determine which peer’s notfound queue to drain, would the following work?

When a notfound message is received, add each transaction to the notfound queue of a random peer selected from those that have announced the transaction.
Unconditionally drain each peer’s notfound queue.

This would avoid the added complexity of a sequence number. It would also ensure that each transaction is requested from only one peer.

sdaftuar reopened this on Oct 29, 2019

fanquake renamed this:
~~[p2p] Request NOTFOUND transactions immediately from other outbound peers, when possible~~
p2p: Request NOTFOUND transactions immediately from other outbound peers, when possible
on Oct 29, 2019

DrahtBot commented at 7:35 pm on October 29, 2019: contributor

DrahtBot added the label Needs rebase on Oct 29, 2019

jnewbery commented at 8:13 pm on November 13, 2019: contributor

@sdaftuar what do you think about @jkczyz’s suggestion for randomly selecting which peer to request the tx from? Conceptually, I think I prefer it to the sequence counter in the net/net_processing interface.

sdaftuar commented at 8:28 pm on November 14, 2019: member

Rather than using a sequence number to determine which peer’s notfound queue to drain, would the following work?

When a notfound message is received, add each transaction to the notfound queue of a random peer selected from those that have announced the transaction.

Unconditionally drain each peer’s notfound queue.

The reason I tend to avoid an approach like this is because the logic for whether we actually send a message to a peer is very complex; there are many reasons why we might call SendMessages() on a peer and then ultimately not be able to send a getdata to it (maybe we have too many getdata’s in flight to them, or maybe we’re processing block requests for them, or maybe their send buffer is full blocking communication for a while, etc). Since there’s no good way to tell whether a getdata would be sent if assigned to a given peer, I think it’s more robust to just signal all of them and let each peer individually figure out if they can make the request.